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Series Foreword 


The goal of building systems that can adapt to their environments and learn from 
their experience has attracted researchers from many fields, including computer 
science, engineering, mathematics, physics, neuroscience, and cognitive science. 
Out of this research has come a wide variety of learning techniques that have 
the potential to transform many scientific and industrial fields. Recently, several 
research communities have converged on a common set of issues surrounding 
supervised, unsupervised, and reinforcement learning problems. The MIT Press 
series on Adaptive Computation and Machine Learning seeks to unify the many 
diverse strands of machine learning research and to foster high quality research 
and innovative applications. 

Learning with Kernels: Support Vector Machines, Regularization, Optimization, and 
Beyond is an excellent illustration of this convergence of ideas from many fields. 
The development of kernel-based learning methods has resulted from a combi- 
nation of machine learning theory, optimization algorithms from operations re- 
search, and kernel techniques from mathematical analysis. These three ideas have 
spread far beyond the original support-vector machine algorithm: Virtually ev- 
ery learning algorithm has been redesigned to exploit the power of kernel meth- 
ods. Bernhard Schélkopf and Alexander Smola have written a comprehensive, yet 
accessible, account of these developments. This volume includes all of the math- 
ematical and algorithmic background needed not only to obtain a basic under- 
standing of the material but to master it. Students and researchers who study this 
book will be able to apply kernel methods in creative ways to solve a wide range 
of problems in science and engineering. 


Thomas Dietterich 


Preface 


One of the most fortunate situations a scientist can encounter is to enter a field in 
its infancy. There is a large choice of topics to work on, and many of the issues 
are conceptual rather than merely technical. Over the last seven years, we have 
had the privilege to be in this position with regard to the field of Support Vector 
Machines (SVMs). We began working on our respective doctoral dissertations in 
1994 and 1996. Upon completion, we decided to combine our efforts and write 
a book about SVMs. Since then, the field has developed impressively, and has to 
an extent been transformed. We set up a website that quickly became the central 
repository for the new community, and a number of workshops were organized 
by various researchers. The scope of the field has now widened significantly, both 
in terms of new algorithms, such as kernel methods different to SVMs, and in 
terms of a deeper theoretical understanding being gained. It has become clear 
that kernel methods provide a framework for tackling some rather profound 
issues in machine learning theory. At the same time, successful applications have 
demonstrated that SVMs not only have a more solid foundation than artificial 
neural networks, but are able to serve as a replacement for neural networks that 
perform as well or better, in a wide variety of fields. Standard neural network and 
pattern recognition textbooks have now started including chapters on SVMs and 
kernel PCA (for instance, [235, 153]). 

While these developments took place, we were trying to strike a balance be- 
tween pursuing exciting new research, and making progress with the slowly grow- 
ing manuscript of this book. In the two and a half years that we worked on the 
book, we faced a number of lessons that we suspect everyone writing a scientific 
monograph — or any other book — will encounter. First, writing a book is more 
work than you think, even with two authors sharing the work in equal parts. Sec- 
ond, our book got longer than planned. Once we exceeded the initially planned 
length of 500 pages, we got worried. In fact, the manuscript kept growing even 
after we stopped writing new chapters, and began polishing things and incorpo- 
rating corrections suggested by colleagues. This was mainly due to the fact that the 
book deals with a fascinating new area, and researchers keep adding fresh material 
to the body of knowledge. We learned that there is no asymptotic regime in writ- 
ing such a book — if one does not stop, it will grow beyond any bound — unless 
one starts cutting. We therefore had to take painful decisions to leave out material 
that we originally thought should be in the book. Sadly, and this is the third point, 
the book thus contains less material than originally planned, especially on the sub- 
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ject of theoretical developments. We sincerely apologize to all researchers who feel 
that their contributions should have been included — the book is certainly biased 
towards our own work, and does not provide a fully comprehensive overview of 
the field. We did, however, aim to provide all the necessary concepts and ideas to 
enable a reader equipped with some basic mathematical knowledge to enter the 
engaging world of machine learning, using theoretically well-founded kernel al- 
gorithms, and to understand and apply the powerful algorithms that have been 
developed over the last few years. 

The book is divided into three logical parts. Each part consists of a brief intro- 
duction and a number of technical chapters. In addition, we include two appen- 
dices containing addenda, technical details, and mathematical prerequisites. Each 
chapter begins with a short discussion outlining the contents and prerequisites; for 
some of the longer chapters, we include a graph that sketches the logical structure 
and dependencies between the sections. At the end of most chapters, we include a 
set of problems, ranging from simple exercises (marked by e) to hard ones (eee); in 
addition, we describe open problems and questions for future research (000).! The 
latter often represent worthwhile projects for a research publication, or even a the- 
sis. References are also included in some of the problems. These references contain 
the solutions to the associated problems, or at least significant parts thereof. 

The overall structure of the book is perhaps somewhat unusual. Rather than 
presenting a logical progression of chapters building upon each other, we occa- 
sionally touch on a subject briefly, only to revisit it later in more detail. For readers 
who are used to reading scientific monographs and textbooks from cover to cover, 
this will amount to some redundancy. We hope, however, that some readers, who 
are more selective in their reading habits (or less generous with their time), and 
only look at those chapters that they are interested in, will benefit. Indeed, no- 
body is expected to read every chapter. Some chapters are fairly technical, and 
cover material included for reasons of completeness. Other chapters, which are 
more relevant to the central subjects of the book, are kept simpler, and should be 
accessible to undergraduate students. 

In a way, this book thus contains several books in one. For instance, the first 
chapter can be read as a standalone “executive summary” of Support Vector and 
kernel methods. This chapter should also provide a fast entry point for practition- 
ers. Someone interested in applying SVMs to a pattern recognition problem might 
want to read Chapters 1 and 7 only. A reader thinking of building their own SVM 
implementation could additionally read Chapter 10, and parts of Chapter 6. Those 
who would like to get actively involved in research aspects of kernel methods, for 
example by “kernelizing” anew algorithm, should probably read at least Chapters 
1 and 2. A one-semester undergraduate course on learning with kernels could in- 
clude the material of Chapters 1, 2.1-2.3, 3.1-3.2, 5.1-5.2, 6.1-6.3, 7. If there is more 


1. We suggest that authors post their solutions on the book website www.learning-with- 
kernels.org. 


Preface 


xvii 


time, one of the Chapters 14, 16, or 17 can be added, or 4.1-4.2. A graduate course 
could additionally deal with the more advanced parts of Chapters 3, 4, and 5. The 
remaining chapters provide ample material for specialized courses and seminars. 

As a general time-saving rule, we recommend reading the first chapter and then 
jumping directly to the chapter of particular interest to the reader. Chances are 
that this will lead to a chapter that contains references to the earlier ones, which 
can then be followed as desired. We hope that this way, readers will inadvertently 
be tempted to venture into some of the less frequented chapters and research areas. 
Explore this book; there is a lot to find, and much more is yet to be discovered in 
the field of learning with kernels. 

We conclude the preface by thanking those who assisted us in the prepara- 
tion of the book. Our first thanks go to our first readers. Chris Burges, Arthur 
Gretton, and Bob Williamson have read through various versions of the book, 
and made numerous suggestions that corrected or improved the material. A 
number of other researchers have proofread various chapters. We would like to 
thank Matt Beal, Daniel Berger, Olivier Bousquet, Ben Bradshaw, Nicolò Cesa- 
Bianchi, Olivier Chapelle, Dennis DeCoste, Andre Elisseeff, Anita Faul, Arnulf 
Graf, Isabelle Guyon, Ralf Herbrich, Simon Hill, Dominik Janzing, Michael Jordan, 
Sathiya Keerthi, Neil Lawrence, Ben O’Loghlin, Ulrike von Luxburg, Davide Mat- 
tera, Sebastian Mika, Natasa Milic-Frayling, Marta Milo, Klaus Miiller, Dave Mu- 
sicant, Fernando Pérez Cruz, Ingo Steinwart, Mike Tipping, and Chris Williams. 

In addition, a large number of people have contributed to this book in one 
way or another, be it by sharing their insights with us in discussions, or by col- 
laborating with us on some of the topics covered in the book. In many places, 
this strongly influenced the presentation of the material. We would like to thank 
Dimitris Achlioptas, Luis Almeida, Shun-Ichi Amari, Peter Bartlett, Jonathan Bax- 
ter, Tony Bell, Shai Ben-David, Kristin Bennett, Matthias Bethge, Chris Bishop, 
Andrew Blake, Volker Blanz, Léon Bottou, Paul Bradley, Chris Burges, Hein- 
rich Bulthoff, Olivier Chapelle, Nello Cristianini, Corinna Cortes, Cameron Daw- 
son, Tom Dietterich, André Elisseeff, Oscar de Feo, Federico Girosi, Thore Graepel, 
Isabelle Guyon, Patrick Haffner, Stefan Harmeling, Paul Hayton, Markus Heg- 
land, Ralf Herbrich, Tommi Jaakkola, Michael Jordan, Jyrki Kivinen, Yann LeCun, 
Chi-Jen Lin, Gabor Lugosi, Olvi Mangasarian, Laurent Massoulie, Sebastian Mika, 
Sayan Mukherjee, Klaus Müller, Noboru Murata, Nuria Oliver, John Platt, Tomaso 
Poggio, Gunnar Ratsch, Sami Romdhani, Rainer von Sachs, Christoph Schnorr, 
Matthias Seeger, John Shawe-Taylor, Kristy Sim, Patrice Simard, Stephen Smale, 
Sara Solla, Lionel Tarassenko, Lily Tian, Mike Tipping, Alexander Tsybakov, Lou 
van den Dries, Santosh Venkatesh, Thomas Vetter, Chris Watkins, Jason Weston, 
Chris Williams, Bob Williamson, Andreas Ziehe, Alex Zien, and Tong Zhang. 

Next, we would like to extend our thanks to the research institutes that allowed 
us to pursue our research interests and to dedicate the time necessary for writing 
the present book; these are AT&T / Bell Laboratories (Holmdel), the Australian 
National University (Canberra), Biowulf Technologies (New York), GMD FIRST 
(Berlin), the Max-Planck-Institute for Biological Cybernetics (Tubingen), and Mi- 
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crosoft Research (Cambridge). We are grateful to Doug Sery from MIT Press for 
continuing support and encouragement during the writing of this book. We are, 
moreover, indebted to funding from various sources; specifically, from the Studi- 
enstiftung des deutschen Volkes, the Deutsche Forschungsgemeinschaft, the Aus- 
tralian Research Council, and the European Union. 

Finally, special thanks go to Vladimir Vapnik, who introduced us to the fasci- 
nating world of statistical learning theory. 


... the story of the sheep dog who was herding his sheep, and serendipitously 
invented both large margin classification and Sheep Vectors... 


Illustration by Ana Martín Larrañaga 
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A Tutorial Introduction 


This chapter describes the central ideas of Support Vector (SV) learning in a 
nutshell. Its goal is to provide an overview of the basic concepts. 

One such concept is that of a kernel. Rather than going immediately into math- 
ematical detail, we introduce kernels informally as similarity measures that arise 
from a particular representation of patterns (Section 1.1), and describe a simple 
kernel algorithm for pattern recognition (Section 1.2). Following this, we report 
some basic insights from statistical learning theory, the mathematical theory that 
underlies SV learning (Section 1.3). Finally, we briefly review some of the main 
kernel algorithms, namely Support Vector Machines (SVMs) (Sections 1.4 to 1.6) 
and kernel principal component analysis (Section 1.7). 

We have aimed to keep this introductory chapter as basic as possible, whilst 
giving a fairly comprehensive overview of the main ideas that will be discussed in 
the present book. After reading it, readers should be able to place all the remaining 
material in the book in context and judge which of the following chapters is of 
particular interest to them. 

As a consequence of this aim, most of the claims in the chapter are not proven. 
Abundant references to later chapters will enable the interested reader to fill in the 
gaps at a later stage, without losing sight of the main ideas described presently. 


1.1 Data Representation and Similarity 


Training Data 


One of the fundamental problems of learning theory is the following: suppose we 
are given two classes of objects. We are then faced with a new object, and we have 
to assign it to one of the two classes. This problem can be formalized as follows: 
we are given empirical data 


(21,1), -< -, (Xm, Ym) E X x {£1}. (1.1) 


Here, X is some nonempty set from which the patterns x; (sometimes called cases, 
inputs, instances, or observations) are taken, usually referred to as the domain; the y; 
are called labels, targets, outputs or sometimes also observations.! Note that there are 


1. Note that we use the term pattern to refer to individual observations. A (smaller) part of 
the existing literature reserves the term for a generic prototype which underlies the data. The 
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only two classes of patterns. For the sake of mathematical convenience, they are 
labelled by +1 and —1, respectively. This is a particularly simple situation, referred 
to as (binary) pattern recognition or (binary) classification. 

It should be emphasized that the patterns could be just about anything, and we 
have made no assumptions on X other than it being a set. For instance, the task 
might be to categorize sheep into two classes, in which case the patterns x; would 
simply be sheep. 

In order to study the problem of learning, however, we need an additional type 
of structure. In learning, we want to be able to generalize to unseen data points. In 
the case of pattern recognition, this means that given some new pattern x € X, we 
want to predict the corresponding y € {+1}.? By this we mean, loosely speaking, 
that we choose y such that (x, y) is in some sense similar to the training examples 
(1.1). To this end, we need notions of similarity in X and in {+1}. 

Characterizing the similarity of the outputs {+1} is easy: in binary classification, 
only two situations can occur: two labels can either be identical or different. The 
choice of the similarity measure for the inputs, on the other hand, is a deep 
question that lies at the core of the field of machine learning. 

Let us consider a similarity measure of the form 


k:XxX>5R 
(x, x’) H k(x, x’), (1.2) 


that is, a function that, given two patterns x and x’, returns a real number charac- 
terizing their similarity. Unless stated otherwise, we will assume that k is symmet- 
ric, that is, k(x, x’) = k(x’, x) for all x, x’ € X. For reasons that will become clear later 
(cf. Remark 2.16), the function k is called a kernel [359, 4, 42, 62, 223]. 

General similarity measures of this form are rather difficult to study. Let us 
therefore start from a particularly simple case, and generalize it subsequently. A 
simple type of similarity measure that is of particular mathematical appeal is a 
dot product. For instance, given two vectors x,x’ € RY, the canonical dot product is 
defined as 


N 
(xx ) =D xx’ la (1.3) 
i=1 


Here, [x]; denotes the ith entry of x. 

Note that the dot product is also referred to as inner product or scalar product, and 
sometimes denoted with round brackets and a dot, as (x - x’) — this is where the 
“dot” in the name comes from. In Section B.2, we give a general definition of dot 
products. Usually, however, it is sufficient to think of dot products as (1.3). 


latter is probably closer to the original meaning of the term, however we decided to stick 
with the present usage, which is more common in the field of machine learning. 
2. Doing this for every x € X amounts to estimating a function f : X 4 {+1}. 
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Length 


Feature Space 


The geometric interpretation of the canonical dot product is that it computes the 
cosine of the angle between the vectors x and x’, provided they are normalized to 
length 1. Moreover, it allows computation of the length (or norm) of a vector x as 


IIx = y x). (1.4) 


Likewise, the distance between two vectors is computed as the length of the 
difference vector. Therefore, being able to compute dot products amounts to being 
able to carry out all geometric constructions that can be formulated in terms of 
angles, lengths and distances. 

Note, however, that the dot product approach is not really sufficiently general 
to deal with many interesting problems. 


= First, we have deliberately not made the assumption that the patterns actually 
exist in a dot product space. So far, they could be any kind of object. In order to 
be able to use a dot product as a similarity measure, we therefore first need to 
represent the patterns as vectors in some dot product space H (which need not 
coincide with RY). To this end, we use a map 


O:X SH 
XH x i= (x). (1.5) 


= Second, even if the original patterns exist in a dot product space, we may still 
want to consider more general similarity measures obtained by applying a map 
(1.5). In that case, ® will typically be a nonlinear map. An example that we will 
consider in Chapter 2 is a map which computes products of entries of the input 
patterns. 


In both the above cases, the space H is called a feature space. Note that we have 
used a bold face x to denote the vectorial representation of x in the feature space. 
We will follow this convention throughout the book. 

To summarize, embedding the data into H via ® has three benefits: 


1. It lets us define a similarity measure from the dot product in H, 
k(x, x’) := x) = 0G), OG). (1.6) 


2. It allows us to deal with the patterns geometrically, and thus lets us study 
learning algorithms using linear algebra and analytic geometry. 


3. The freedom to choose the mapping ® will enable us to design a large variety 
of similarity measures and learning algorithms. This also applies to the situation 
where the inputs x; already exist in a dot product space. In that case, we might 
directly use the dot product as a similarity measure. However, nothing prevents us 
from first applying a possibly nonlinear map ® to change the representation into 
one that is more suitable for a given problem. This will be elaborated in Chapter 2, 
where the theory of kernels is developed in more detail. 
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1.2 A Simple Pattern Recognition Algorithm 


Decision 
Function 


We are now in the position to describe a pattern recognition learning algorithm 
that is arguably one of the simplest possible. We make use of the structure intro- 
duced in the previous section; that is, we assume that our data are embedded into 
a dot product space H.3 Using the dot product, we can measure distances in this 
space. The basic idea of the algorithm is to assign a previously unseen pattern to 
the class with closer mean. 

We thus begin by computing the means of the two classes in feature space; 


1 


CESS 5 Xi, (1.7) 
m {ily=+1} 

ee, ae (1.8) 
M- Sily=—1} 


where m4 and m_ are the number of examples with positive and negative labels, 
respectively. We assume that both classes are non-empty, thus m4, m- > 0. We 
assign a new point x to the class whose mean is closest (Figure 1.1). This geometric 
construction can be formulated in terms of the dot product (+, -). Half way between 
c4 and c_ lies the point c := (cy + c_)/2. We compute the class of x by checking 
whether the vector x — c connecting c to x encloses an angle smaller than 7/2 with 
the vector w := c4 — c_ connecting the class means. This leads to 


y = sgn ((x —c), w) 
= sgn ((x — (c4 +¢_)/2), (c+ — c-)) 
= sgn((x,¢4) — (x,¢-) +b). (1.9) 


Here, we have defined the offset 
1 
b= 5(lle-IP = lle?) (1.10) 


with the norm ||x|| := \/(x,x). If the class means have the same distance to the 
origin, then b will vanish. 

Note that (1.9) induces a decision boundary which has the form of a hyperplane 
(Figure 1.1); that is, a set of points that satisfy a constraint expressible as a linear 
equation. 

It is instructive to rewrite (1.9) in terms of the input patterns x;, using the kernel 
k to compute the dot products. Note, however, that (1.6) only tells us how to 
compute the dot products between vectorial representations x; of inputs x;. We 
therefore need to express the vectors c; and w in terms of X1,...,Xm- 

To this end, substitute (1.7) and (1.8) into (1.9) to get the decision function 


3. For the definition of a dot product space, see Section B.2. 
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Figure 1.1 A simple geometric classification algorithm: given two classes of points (de- 
picted by ʻo’ and ‘+’), compute their means c;,c_ and assign a test pattern x to the one 
whose mean is closer. This can be done by looking at the dot product between x — c (where 
c = (c4 +c_)/2) and w := c4 —c_, which changes sign as the enclosed angle passes through 
1/2. Note that the corresponding decision boundary is a hyperplane (the dotted line) or- 
thogonal to w. 


v= (5 x Go) 5 5 fox) +0) 


T {ilyist1 T {ilyi=-1} 
1 1 
= sgn (+ > k(x, x;) = m- 5 k(x, xi) + e) . (1.11) 
T {ily} ~ {ily=—1} 
Similarly, the offset becomes 
1 1 1 
b:= 5 HE By k(x;, x;) = ae 5 k(x;, x;) i (1.12) 
= {655-1} + {G Dly=yj=H} 


Surprisingly, it turns out that this rather simple-minded approach contains a well- 
known statistical classification method as a special case. Assume that the class 
means have the same distance to the origin (hence b = 0, cf. (1.10)), and that k can 
be viewed as a probability density when one of its arguments is fixed. By this we 
mean that it is positive and has unit integral,4 


L k(x,x"\dx =1 for all x’ € X. (1.13) 
A 


In this case, (1.11) takes the form of the so-called Bayes classifier separating the two 
classes, subject to the assumption that the two classes of patterns were generated 
by sampling from two probability distributions that are correctly estimated by the 


4. In order to state this assumption, we have to require that we can define an integral on X. 


Parzen Windows 
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Parzen windows estimators of the two class densities, 


1 1 

p(x) = —— È K(x,x))and p_(x):=—— È kax), (1.14) 
+ {ily=+1} T fily=—1} 

where x € X. 


Given some point x, the label is then simply computed by checking which of 
the two values p(x) or p_(x) is larger, which leads directly to (1.11). Note that 
this decision is the best we can do if we have no prior information about the 
probabilities of the two classes. 

The classifier (1.11) is quite close to the type of classifier that this book deals with 
in detail. Both take the form of kernel expansions on the input domain, 


m 
y = sgn (5 aik(x, xi) + ) . (1.15) 
i=1 
In both cases, the expansions correspond to a separating hyperplane in a feature 
space. In this sense, the a; can be considered a dual representation of the hyper- 
plane’s normal vector [223]. Both classifiers are example-based in the sense that 
the kernels are centered on the training patterns; that is, one of the two arguments 
of the kernel is always a training pattern. A test point is classified by comparing it 
to all the training points that appear in (1.15) with a nonzero weight. 

More sophisticated classification techniques, to be discussed in the remainder 
of the book, deviate from (1.11) mainly in the selection of the patterns on which 
the kernels are centered and in the choice of weights a; that are placed on the 
individual kernels in the decision function. It will no longer be the case that all 
training patterns appear in the kernel expansion, and the weights of the kernels 
in the expansion will no longer be uniform within the classes — recall that in the 
current example, cf. (1.11), the weights are either (1/m4) or (—1/m_), depending 
on the class to which the pattern belongs. 

In the feature space representation, this statement corresponds to saying that 
we will study normal vectors w of decision hyperplanes that can be represented 
as general linear combinations (i.e., with non-uniform coefficients) of the training 
patterns. For instance, we might want to remove the influence of patterns that are 
very far away from the decision boundary, either since we expect that they will not 
improve the generalization error of the decision function, or since we would like to 
reduce the computational cost of evaluating the decision function (cf. (1.11)). The 
hyperplane will then only depend on a subset of training patterns called Support 
Vectors. 


1.3 Some Insights From Statistical Learning Theory 


With the above example in mind, let us now consider the problem of pattern 
recognition in a slightly more formal setting [559, 152, 186]. This will allow us 
to indicate the factors affecting the design of “better” algorithms. Rather than just 
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Figure1.2 2D toy example of binary classification, solved using three models (the decision 
boundaries are shown). The models vary in complexity, ranging from a simple one (left), 
which misclassifies a large number of points, to a complex one (right), which “trusts” each 
point and comes up with solution that is consistent with all training points (but may not 
work well on new points). As an aside: the plots were generated using the so-called soft- 
margin SVM to be explained in Chapter 7; cf. also Figure 7.10. 


providing tools to come up with new algorithms, we also want to provide some 
insight in how to do it in a promising way. 
In two-class pattern recognition, we seek to infer a function 


f:X— {41} (1.16) 


from input-output training data (1.1). The training data are sometimes also called 
the sample. 

Figure 1.2 shows a simple 2D toy example of a pattern recognition problem. 
The task is to separate the solid dots from the circles by finding a function which 
takes the value 1 on the dots and —1 on the circles. Note that instead of plotting 
this function, we may plot the boundaries where it switches between 1 and —1. 
In the rightmost plot, we see a classification function which correctly separates 
all training points. From this picture, however, it is unclear whether the same 
would hold true for test points which stem from the same underlying regularity. 
For instance, what should happen to a test point which lies close to one of the 
two “outliers,” sitting amidst points of the opposite class? Maybe the outliers 
should not be allowed to claim their own custom-made regions of the decision 
function. To avoid this, we could try to go for a simpler model which disregards 
these points. The leftmost picture shows an almost linear separation of the classes. 
This separation, however, not only misclassifies the above two outliers, but also 
a number of “easy” points which are so close to the decision boundary that 
the classifier really should be able to get them right. Finally, the central picture 
represents a compromise, by using a model with an intermediate complexity, 
which gets most points right, without putting too much trust in any individual 
point. 

The goal of statistical learning theory is to place these intuitive arguments in 
a mathematical framework. To this end, it studies mathematical properties of 
learning machines. These properties are usually properties of the function class 


IID Data 


Loss Function 


Test Data 


Empirical Risk 


Risk 
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Figure 1.3 A 1D classification problem, with a training set of three points (marked by cir- 
cles), and three test inputs (marked on the x-axis). Classification is performed by threshold- 
ing real-valued functions g(x) according to sgn (f(x)). Note that both functions (dotted line, 
and solid line) perfectly explain the training data, but they give opposite predictions on the 
test inputs. Lacking any further information, the training data alone give us no means to 
tell which of the two functions is to be preferred. 


that the learning machine can implement. 

We assume that the data are generated independently from some unknown (but 
fixed) probability distribution P(x, y).° This is a standard assumption in learning 
theory; data generated this way is commonly referred to as iid (independent and 
identically distributed). Our goal is to find a function f that will correctly classify 
unseen examples (x, y), so that f(x) = y for examples (x, y) that are also generated 
from P(x, y). Correctness of the classification is measured by means of the zero-one 
loss function c(x, y, f(x)) := 5|F(x) — y|. Note that the loss is 0 if (x, y) is classified 
correctly, and 1 otherwise. 

If we put no restriction on the set of functions from which we choose our 
estimated f, however, then even a function that does very well on the training 
data, e.g., by satisfying f(x;) = y; for alli =1,...,m, might not generalize well 
to unseen examples. To see this, note that for each function f and any test set 
(%1,91),---,(Xm, Jm) E X x {£1}, satisfying {%1,...,X%n}O{x1,...,X%m} = Ø, there 
exists another function f* such that f*(x;) = f(x;) for alli =1,...,m, yet f*(%)) Æ 
f(X) for alli =1,...,m (cf. Figure 1.3). As we are only given the training data, we 
have no means of selecting which of the two functions (and hence which of the two 
different sets of test label predictions) is preferable. We conclude that minimizing 
only the (average) training error (or empirical risk), 


=- 2 = [F(x -yil (1.17) 


Rempl f] = m&n? 
i=1 


does not imply a small test error (called risk), averaged over test examples drawn 
from the underlying distribution P(x, y), 


5. For a definition of a probability distribution, see Section B.1.1. 
6. We mostly use the term example to denote a pair consisting of a training pattern x and 
the corresponding target y. 
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Capacity 


VC dimension 


Shattering 


VC Bound 


RUA = f 3I- yl dP, y). (1.18) 


The risk can be defined for any loss function, provided the integral exists. For the 
present zero-one loss function, the risk equals the probability of misclassification.” 

Statistical learning theory (Chapter 5, [570, 559, 561, 136, 562, 14]), or VC 
(Vapnik-Chervonenkis) theory, shows that it is imperative to restrict the set of 
functions from which f is chosen to one that has a capacity suitable for the amount 
of available training data. VC theory provides bounds on the test error. The min- 
imization of these bounds, which depend on both the empirical risk and the ca- 
pacity of the function class, leads to the principle of structural risk minimization 
[559]. 

The best-known capacity concept of VC theory is the VC dimension, defined as 
follows: each function of the class separates the patterns in a certain way and thus 
induces a certain labelling of the patterns. Since the labels are in {+1}, there are 
at most 2” different labellings for m patterns. A very rich function class might be 
able to realize all 2" separations, in which case it is said to shatter the m points. 
However, a given class of functions might not be sufficiently righ to shatter the 
m points. The VC dimension is defined as the largest m such that there exists a 
set of m points which the class can shatter, and oo if no such m exists. It can be 
thought of as a one-number summary of a learning machine’s capacity (for an 
example, see Figure 1.4). As such, it is necessarily somewhat crude. More accurate 
capacity measures are the annealed VC entropy or the growth function. These are 
usually considered to be harder to evaluate, but they play a fundamental role in 
the conceptual part of VC theory. Another interesting capacity measure, which can 
be thought of as a scale-sensitive version of the VC dimension, is the fat shattering 
dimension [286, 6]. For further details, cf. Chapters 5 and 12. 

Whilst it will be difficult for the non-expert to appreciate the results of VC theory 
in this chapter, we will nevertheless briefly describe an example of a VC bound: 


7. The risk-based approach to machine learning has its roots in statistical decision theory 
[582, 166, 43]. In that context, f(x) is thought of as an action, and the loss function measures 
the loss incurred by taking action f(x) upon observing x when the true output (state of 
nature) is y. 

Like many fields of statistics, decision theory comes in two flavors. The present approach 
is a frequentist one. It considers the risk as a function of the distribution P and the decision 
function f. The Bayesian approach considers parametrized families Po to model the distri- 
bution. Given a prior over © (which need not in general be a finite-dimensional vector), 
the Bayes risk of a decision function f is the expected frequentist risk, where the expectation 
is taken over the prior. Minimizing the Bayes risk (over decision functions) then leads to 
a Bayes decision function. Bayesians thus act as if the parameter © were actually a random 
variable whose distribution is known. Frequentists, who do not make this (somewhat bold) 
assumption, have to resort to other strategies for picking a decision function. Examples 
thereof are considerations like invariance and unbiasedness, both used to restrict the class 
of decision rules, and the minimax principle. A decision function is said to be minimax if 
it minimizes (over all decision functions) the maximal (over all distributions) risk. For a 
discussion of the relationship of these issues to VC theory, see Problem 5.9. 
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Figure 1.4 A simple VC dimension example. There are 2° = 8 ways of assigning 3 points 
to two classes. For the displayed points in R?, all 8 possibilities can be realized using 
separating hyperplanes, in other words, the function class can shatter 3 points. This would 
not work if we were given 4 points, no matter how we placed them. Therefore, the VC 
dimension of the class of separating hyperplanes in R? is 3. 


if h < m is the VC dimension of the class of functions that the learning machine 
can implement, then for all functions of that class, independent of the underlying 
distribution P generating the data, with a probability of at least 1 — 6 over the 
drawing of the training sample,’ the bound 


RIF] < Rempl f] + (h, m, 6) (1.19) 


holds, where the confidence term (or capacity term) ¢ is defined as 


oh, m, 5) = L (r (in +1) +In 5): (1.20) 


The bound (1.19) merits further explanation. Suppose we wanted to learn a 
“dependency” where patterns and labels are statistically independent, P(x, y) = 
P(x)P(y). In that case, the pattern x contains no information about the label y. If, 
moreover, the two classes +1 and —1 are equally likely, there is no way of making 
a good guess about the label of a test pattern. 

Nevertheless, given a training set of finite size, we can always come up with 
a learning machine which achieves zero training error (provided we have no 
examples contradicting each other, i.e., whenever two patterns are identical, then 
they must come with the same label). To reproduce the random labellings by 
correctly separating all training examples, however, this machine will necessarily 
require a large VC dimension h. Therefore, the confidence term (1.20), which 
increases monotonically with h, will be large, and the bound (1.19) will show 


8. Recall that each training example is generated from P(x, y), and thus the training data 
are subject to randomness. 
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that the small training error does not guarantee a small test error. This illustrates 
how the bound can apply independent of assumptions about the underlying 
distribution P(x, y): it always holds (provided that h < m), but it does not always 
make a nontrivial prediction. In order to get nontrivial predictions from (1.19), 
the function class must be restricted such that its capacity (e.g., VC dimension) 
is small enough (in relation to the available amount of data). At the same time, 
the class should be large enough to provide functions that are able to model the 
dependencies hidden in P(x, y). The choice of the set of functions is thus crucial for 
learning from data. In the next section, we take a closer look at a class of functions 
which is particularly interesting for pattern recognition problems. 


1.4 Hyperplane Classifiers 


Optimal 
Hyperplane 


In the present section, we shall describe a hyperplane learning algorithm that can 
be performed in a dot product space (such as the feature space that we introduced 
earlier). As described in the previous section, to design learning algorithms whose 
statistical effectiveness can be controlled, one needs to come up with a class of 
functions whose capacity can be computed. Vapnik et al. [573, 566, 570] considered 
the class of hyperplanes in some dot product space K, 


(w,x) +b =0 where w € H,b E€ R, (1.21) 
corresponding to decision functions 
f(x) = sen((w, x) +b), (1.22) 


and proposed a learning algorithm for problems which are separable by hyper- 
planes (sometimes said to be linearly separable), termed the Generalized Portrait, for 
constructing f from empirical data. It is based on two facts. First (see Chapter 7), 
among all hyperplanes separating the data, there exists a unique optimal hyper- 
plane, distinguished by the maximum margin of separation between any training 
point and the hyperplane. It is the solution of 

oe min {||x — x;|| |x € K, (w,x) +b =0,i=1,...,m}. (1.23) 
Second (see Chapter 5), the capacity (as discussed in Section 1.3) of the class of sep- 
arating hyperplanes decreases with increasing margin. Hence there are theoretical 
arguments supporting the good generalization performance of the optimal hyper- 
plane, cf. Chapters 5, 7, 12. In addition, it is computationally attractive, since we 
will show below that it can be constructed by solving a quadratic programming 
problem for which efficient algorithms exist (see Chapters 6 and 10). 

Note that the form of the decision function (1.22) is quite similar to our earlier 
example (1.9). The ways in which the classifiers are trained, however, are different. 
In the earlier example, the normal vector of the hyperplane was trivially computed 
from the class means as w = c; — c. 
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Figure 1.5 A binary classification toy problem: separate balls from diamonds. The optimal 
hyperplane (1.23) is shown as a solid line. The problem being separable, there exists a weight 
vector w and a threshold b such that yi((w,xi) + b) > 0 (i =1,...,m). Rescaling w and 
b such that the point(s) closest to the hyperplane satisfy |(w,x;) + b| = 1, we obtain a 
canonical form (w,b) of the hyperplane, satisfying yi((w,xi) + b) > 1. Note that in this 
case, the margin (the distance of the closest point to the hyperplane) equals 1/||w||. This 
can be seen by considering two points x;,x2 on opposite sides of the margin, that is, 
(w,x1) +b = 1, (w,x2) +b = —1, and projecting them onto the hyperplane normal vector 
w/w. 


In the present case, we need to do some additional work to find the normal 
vector that leads to the largest margin. To construct the optimal hyperplane, we 
have to solve 


Sy bap 
=> 1.24 
minimize 7(w) = |w] (1.24) 
subject to y;((w,x;) +b) > 1 for alli=1,...,m. (1.25) 


Note that the constraints (1.25) ensure that f(x;) will be +1 for y; = +1, and —1 
for y; = —1. Now one might argue that for this to be the case, we don’t actually 
need the “> 1” on the right hand side of (1.25). However, without it, it would 
not be meaningful to minimize the length of w: to see this, imagine we wrote 
“> 0” instead of “> 1.” Now assume that the solution is (w, b). Let us rescale this 
solution by multiplication with some 0 < A < 1. Since A > 0, the constraints are 
still satisfied. Since A < 1, however, the length of w has decreased. Hence (w, b) 
cannot be the minimizer of 7(w). 

The “> 1” on the right hand side of the constraints effectively fixes the scaling 
of w. In fact, any other positive number would do. 

Let us now try to get an intuition for why we should be minimizing the length 
of w, as in (1.24). If ||w|| were 1, then the left hand side of (1.25) would equal 
the distance from x; to the hyperplane (cf. (1.23)). In general, we have to divide 
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Lagrangian 


KKT Conditions 


yi((w, xi) + b) by ||w|| to transform it into this distance. Hence, if we can satisfy 
(1.25) for alli =1,...,m with an w of minimal length, then the overall margin will 
be maximized. 

A more detailed explanation of why this leads to the maximum margin hyper- 
plane will be given in Chapter 7. A short summary of the argument is also given 
in Figure 1.5. 

The function 7 in (1.24) is called the objective function, while (1.25) are called in- 
equality constraints. Together, they form a so-called constrained optimization problem. 
Problems of this kind are dealt with by introducing Lagrange multipliers a; > 0 and 
a Lagrangian? 


m 


Liw, b,a) = Sill? — $ a; (vxw) +b) —1). (1.26) 


i= 


The Lagrangian L has to be minimized with respect to the primal variables w and b 
and maximized with respect to the dual variables a; (in other words, a saddle point 
has to be found). Note that the constraint has been incorporated into the second 
term of the Lagrangian; it is not necessary to enforce it explicitly. 

Let us try to get some intuition for this way of dealing with constrained opti- 
mization problems. If a constraint (1.25) is violated, then y;((w,x;) + b) —1 <0, 
in which case L can be increased by increasing the corresponding a;. At the 
same time, w and b will have to change such that L decreases. To prevent 
a;i (yi((w,x;) +b) — 1) from becoming an arbitrarily large negative number, the 
change in w and b will ensure that, provided the problem is separable, the 
constraint will eventually be satisfied. Similarly, one can understand that for 
all constraints which are not precisely met as equalities (that is, for which 
yi((w,xi) + b) — 1 > 0), the corresponding a; must be 0: this is the value of a; 
that maximizes L. The latter is the statement of the Karush-Kuhn-Tucker (KKT) 
complementarity conditions of optimization theory (Chapter 6). 

The statement that at the saddle point, the derivatives of L with respect to the 
primal variables must vanish, 


iw, b, aœ) = 0 and Lw, b,a)=0, (1.27) 
leads to 

5 aiyi =0 (1.28) 
i=1 

and 

w= J QUYiXi- (1.29) 


i=l 


9. Henceforth, we use boldface Greek letters as a shorthand for corresponding vectors 
Q = (Q1, -3 Qm). 
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The solution vector thus has an expansion (1.29) in terms of a subset of the training 
patterns, namely those patterns with non-zero aj, called Support Vectors (SVs) (cf. 
(1.15) in the initial example). By the KKT conditions, 


ai [yi (xi, w) +b) — 1] = 0 for alli =1,...,m, (1.30) 


the SVs lie on the margin (cf. Figure 1.5). All remaining training examples (x;, y;) 
are irrelevant: their constraint y;((w,x;) + b) > 1 (cf. (1.25)) could just as well 
be left out, and they do not appear in the expansion (1.29). This nicely captures 
our intuition of the problem: as the hyperplane (cf. Figure 1.5) is completely 
determined by the patterns closest to it, the solution should not depend on the 
other examples. 

By substituting (1.28) and (1.29) into the Lagrangian (1.26), one eliminates the 
primal variables w and b, arriving at the so-called dual optimization problem, which 
is the problem that one usually solves in practice: 


m 1 m 
maximize W(a) = > Q= > QiQjYiY j (xi, Xj) (1.31) 
OCR i=1 2 a 
m 
subject to a; > 0 for alli=1,...,mand 5 Qiyi = 0. (1.32) 


i=! 


Using (1.29), the hyperplane decision function (1.22) can thus be written as 


f(x) =sgn (È yidi (x, Xi) + o) j (1.33) 
i=1 
where b is computed by exploiting (1.30) (for details, cf. Chapter 7). 

The structure of the optimization problem closely resembles those that typically 
arise in Lagrange’s formulation of mechanics (e.g., [206]). In the latter class of 
problem, it is also often the case that only a subset of constraints become active. 
For instance, if we keep a ball in a box, then it will typically roll into one of the 
corners. The constraints corresponding to the walls which are not touched by the 
ball are irrelevant, and those walls could just as well be removed. 

Seen in this light, it is not too surprising that it is possible to give a mechanical 
interpretation of optimal margin hyperplanes [87]: If we assume that each SV x; 
exerts a perpendicular force of size a; and direction y; - w/||w|| on a solid plane 
sheet lying along the hyperplane, then the solution satisfies the requirements for 
mechanical stability. The constraint (1.28) states that the forces on the sheet sum to 
zero, and (1.29) implies that the torques also sum to zero, via ); x; x yia;w/||w|| = 
w x w/||w|| = 0.10 This mechanical analogy illustrates the physical meaning of the 
term Support Vector. 


10. Here, the x denotes the vector (or cross) product, satisfying v x v = 0 for all v € H. 


1.5 Support Vector Classification 15 


A input space A feature space 
O + 
+ 
PP _ n 
Cy 
O @) 2 
O 
O 
= e 


Figure 1.6 The idea of SVMs: map the training data into a higher-dimensional feature 
space via ®, and construct a separating hyperplane with maximum margin there. This 
yields a nonlinear decision boundary in input space. By the use of a kernel function (1.2), it 
is possible to compute the separating hyperplane without explicitly carrying out the map 
into the feature space. 


1.5 Support Vector Classification 


Decision 
Function 


We now have all the tools to describe SVMs (Figure 1.6). Everything in the last 
section was formulated in a dot product space. We think of this space as the feature 
space H of Section 1.1. To express the formulas in terms of the input patterns in X, 
we thus need to employ (1.6), which expresses the dot product of bold face feature 
vectors x, x’ in terms of the kernel k evaluated on input patterns x, x’, 


Kee = GX) (1.34) 


This substitution, which is sometimes referred to as the kernel trick, was used by 
Boser, Guyon, and Vapnik [62] to extend the Generalized Portrait hyperplane clas- 
sifier to nonlinear Support Vector Machines. Aizerman, Braverman, and Rozonoér 
[4] called H the linearization space, and used it in the context of the potential func- 
tion classification method to express the dot product between elements of H in 
terms of elements of the input space. 

The kernel trick can be applied since all feature vectors only occurred in dot 
products (see (1.31) and (1.33)). The weight vector (cf. (1.29)) then becomes an 
expansion in feature space, and therefore will typically no longer correspond to 
the @-image of a single input space vector (cf. Chapter 18). We obtain decision 
functions of the form (cf. (1.33)) 


f(x) = sgn (È yia; (P(x), D(x) +0) = sgn (5 yiouk(2, xi) +) (1.35) 
i=1 i=1 
and the following quadratic program (cf. (1.31)): 
maximize W(a) = ¥ a; — l X, aiajyiyjk(xi x;) (1.36) 
QER” i=1 2 i= 


subject to a; > 0 foralli=1,...,m, and > Oy =0. (1.37) 
i=1 
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Figure 1.7 Example of an SV classifier found using a radial basis function kernel k(x, x’) = 
exp(—||x — x’||?) (here, the input space is X = [—1,1]*). Circles and disks are two classes of 
training examples; the middle line is the decision surface; the outer lines precisely meet the 
constraint (1.25). Note that the SVs found by the algorithm (marked by extra circles) are not 
centers of clusters, but examples which are critical for the given classification task. Gray 


m 


values code |X, y;a;k(x, x;) + b|, the modulus of the argument of the decision function 
(1.35). The top and the bottom lines indicate places where it takes the value 1 (from [471]). 


Figure 1.7 shows an example of this approach, using a Gaussian radial basis 
function kernel. We will later study the different possibilities for the kernel func- 
tion in detail (Chapters 2 and 13). 

In practice, a separating hyperplane may not exist, e.g., if a high noise level 
causes a large overlap of the classes. To allow for the possibility of examples 
violating (1.25), one introduces slack variables [111, 561, 481] 


& > 0forali=1,...,m, (1.38) 
in order to relax the constraints (1.25) to 
yi((w, x;) +b) > 1— & for alli=1,...,m. (1.39) 


A classifier that generalizes well is then found by controlling both the classifier 
capacity (via ||w||) and the sum of the slacks $; ¿;. The latter can be shown to 
provide an upper bound on the number of training errors. 

One possible realization of such a soft margin classifier is obtained by minimizing 
the objective function 


rw, 8) = lw +O & (1.40) 
i=1 
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subject to the constraints (1.38) and (1.39), where the constant C > 0 determines 
the trade-off between margin maximization and training error minimization.!! 
Incorporating a kernel, and rewriting it in terms of Lagrange multipliers, this again 
leads to the problem of maximizing (1.36), subject to the constraints 

0<a;<C foralli=1,...,m, and $ ay; =0. (1.41) 

i=1 

The only difference from the separable case is the upper bound C on the Lagrange 
multipliers a;. This way, the influence of the individual patterns (which could be 
outliers) gets limited. As above, the solution takes the form (1.35). The threshold 
b can be computed by exploiting the fact that for all SVs x; with a; < C, the slack 
variable €; is zero (this again follows from the KKT conditions), and hence 


m 


by ay jk(xi, xj) +b = yj. (1.42) 
j=1 

Geometrically speaking, choosing b amounts to shifting the hyperplane, and (1.42) 
states that we have to shift the hyperplane such that the SVs with zero slack 
variables lie on the +1 lines of Figure 1.5. 

Another possible realization of a soft margin variant of the optimal hyperplane 
uses the more natural v-parametrization. In it, the parameter C is replaced by a 
parameter v € (0,1] which can be shown to provide lower and upper bounds 
for the fraction of examples that will be SVs and those that will have non-zero 
slack variables, respectively. It uses a primal objective function with the error term 
(4 Yi éi) — p instead of C $; & (cf. (1.40)), and separation constraints that involve 
a margin parameter p, 


yi((w,xi) +b) > p — & for alli=1,...,m, (1.43) 


which itself is a variable of the optimization problem. The dual can be shown 
to consist in maximizing the quadratic part of (1.36), subject to 0 < a; < 1/(vm), 
>; aiyi = 0 and the additional constraint >}; a; = 1. We shall return to these methods 
in more detail in Section 7.5. 


1.6 Support Vector Regression 


Let us turn to a problem slightly more general than pattern recognition. Rather 
than dealing with outputs y € {+1}, regression estimation is concerned with esti- 
mating real-valued functions. 

To generalize the SV algorithm to the regression case, an analog of the soft 
margin is constructed in the space of the target values y (note that we now have 


11. It is sometimes convenient to scale the sum in (1.40) by C/m rather than C, as done in 
Chapter 7 below. 
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Figure1.8 In SV regression, a tube with radius £ is fitted to the data. The trade-off between 
model complexity and points lying outside of the tube (with positive slack variables €) is 
determined by minimizing (1.47). 


y € Ñ) by using Vapnik’s ¢-insensitive loss function [561] (Figure 1.8, see Chapters 3 
and 9) . This quantifies the loss incurred by predicting f(x) instead of y as 


(x, y, F(X) = [y — fle = max{0, |y — feo — £}. (1.44) 
To estimate a linear regression 
f(x) = (w, x) +b, (1.45) 


one minimizes 

1 m 

slwl? +C È v: = fle- (1.46) 
i=1 


Note that the term ||w]|? is the same as in pattern recognition (cf. (1.40)); for further 
details, cf. Chapter 9. 

We can transform this into a constrained optimization problem by introducing 
slack variables, akin to the soft margin case. In the present case, we need two types 
of slack variable for the two cases f(x;) — y; > £ and y; — f (xi) > £. We denote them 
by £ and &*, respectively, and collectively refer to them as €“. 

The optimization problem is given by 


1 u : 
minimize r(w,&®) = =||w|? +C 4 (6+ E) (1.47) 
wed, ER”,bER 2 i=1 
subject to f(x;)— yi < € + & (1.48) 
yi— fx) Sere; (1.49) 
EE >O foralli=1,...,m. (1.50) 


Note that according to (1.48) and (1.49), any error smaller than £ does not require 
a nonzero €; or &* and hence does not enter the objective function (1.47). 
Generalization to kernel-based regression estimation is carried out in an analo- 
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Function 


v-SV Regression 


gous manner to the case of pattern recognition. Introducing Lagrange multipliers, 
one arrives at the following optimization problem (for C, ¢ > 0 chosen a priori): 


m m 
maximize W(a,a") = -e5 (a} +a) + Daj — aiy: 
i=1 i=1 1 51) 
ee f z 
-5 Y (af = aay = aklai xj) 
i j=1 
m 
subject to 0 < a; a} <C foralli=1,...,m, and È (a; — a}) =0. (1.52) 


i=l 


The regression estimate takes the form 
f(x) = Maj — aikai, x) +b, (1.53) 
i=l 


where b is computed using the fact that (1.48) becomes an equality with €; = 0 if 
0 < a; < C, and (1.49) becomes an equality with €* = 0 if 0 < a; < C (for details, 
see Chapter 9). The solution thus looks quite similar to the pattern recognition case 
(cf. (1.35) and Figure 1.9). 

A number of extensions of this algorithm are possible. From an abstract point of 
view, we just need some target function which depends on (w, &) (cf. (1.47)). There 
are multiple degrees of freedom for constructing it, including some freedom how 
to penalize, or regularize. For instance, more general loss functions can be used for 
€, leading to problems that can still be solved efficiently ([512, 515], cf. Chapter 9). 
Moreover, norms other than the 2-norm ||.|| can be used to regularize the solution 
(see Sections 4.9 and 9.4). 

Finally, the algorithm can be modified such that ¢ need not be specified a priori. 
Instead, one specifies an upper bound 0 < v < 1 on the fraction of points allowed 
to lie outside the tube (asymptotically, the number of SVs) and the corresponding £ 
is computed automatically. This is achieved by using as primal objective function 


sili? +c (m+ =s) (1.54) 


i=1 


instead of (1.46), and treating £ > 0 as a parameter over which we minimize. For 
more detail, cf. Section 9.3. 


1.7 Kernel Principal Component Analysis 


The kernel method for computing dot products in feature spaces is not restricted 
to SVMs. Indeed, it has been pointed out that it can be used to develop nonlinear 
generalizations of any algorithm that can be cast in terms of dot products, such as 
principal component analysis (PCA) [480]. 

Principal component analysis is perhaps the most common feature extraction 
algorithm; for details, see Chapter 14. The term feature extraction commonly refers 
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to procedures for extracting (real) numbers from patterns which in some sense 
represent the crucial information contained in these patterns. 

PCA in feature space leads to an algorithm called kernel PCA. By solving an 
eigenvalue problem, the algorithm computes nonlinear feature extraction func- 
tions 


fala) = ¥ af k(x, 2), (1.55) 
i=1 


where, up to a normalizing constant, the a? are the components of the nth eigen- 
vector of the kernel matrix Kj; := (k(x;, x;)). 

In a nutshell, this can be understood as follows. To do PCA in K, we wish to 
find eigenvectors v and eigenvalues of the so-called covariance matrix C in the 
feature space, where 


m 


C:= O(x;)P(x;) |. (1.56) 
=1 


1 
Mi 
Here, ®(x;)' denotes the transpose of ®(x;) (see Section B.2.1). In the case when 
H is very high dimensional, the computational costs of doing this directly are 
prohibitive. Fortunately, one can show that all solutions to 


Cv =)\v (1.57) 


with A Æ 0 must lie in the span of ®-images of the training data. Thus, we may 
expand the solution v as 


v= J a(x), (1.58) 
i=1 

thereby reducing the problem to that of finding the a;. It turns out that this leads 

to a dual eigenvalue problem for the expansion coefficients, 


mra = Ka, (1.59) 


where @ = (a1,...,Qm)!. 
To extract nonlinear features from a test point x, we compute the dot product 
between ®(x) and the nth normalized eigenvector in feature space, 


m 


(v",@(x)) = ¥ alk(xi,2). (1.60) 
i=1 


Usually, this will be computationally far less expensive than taking the dot product 
in the feature space explicitly. 

A toy example is given in Chapter 14 (Figure 14.4). As in the case of SVMs, the 
architecture can be visualized by Figure 1.9. 
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Figure 1.9 Architecture of SVMs and related kernel methods. The input x and the expan- 
sion patterns (SVs) x; (we assume that we are dealing with handwritten digits) are nonlin- 
early mapped (by ®) into a feature space where dot products are computed. Through 
the use of the kernel k, these two layers are in practice computed in one step. The results 
are linearly combined using weights v;, found by solving a quadratic program (in pattern 
recognition, vi = yiQi; in regression estimation, v; = a; — Qi) or an eigenvalue problem 
(Kernel PCA). The linear combination is fed into the function o (in pattern recognition, 
a(x) = sgn (x + b); in regression estimation, a(x) = x + b; in Kernel PCA, a(x) = x). 


1.8 Empirical Results and Implementations 


Examples of 
Kernels 


Having described the basics of SVMs, we now summarize some empirical find- 
ings. By the use of kernels, the optimal margin classifier was turned into a high- 
performance classifier. Surprisingly, it was observed that the polynomial kernel 


k(x, x!) = (x, x", (1.61) 

the Gaussian 

eee (1.62) 
? P 20? , ` 

and the sigmoid 

k(x, x’) = tanh (k (x, x’) + 0) 5 (1.63) 


with suitable choices of d € N and o,«,© € R (here, X C RY), empirically led to 
SV classifiers with very similar accuracies and SV sets (Section 7.8.2). In this sense, 
the SV set seems to characterize (or compress) the given task in a manner which 
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to some extent is independent of the type of kernel (that is, the type of classifier) 
used, provided the kernel parameters are well adjusted. 

Initial work at AT&T Bell Labs focused on OCR (optical character recognition), 
a problem where the two main issues are classification accuracy and classification 
speed. Consequently, some effort went into the improvement of SVMs on these 
issues, leading to the Virtual SV method for incorporating prior knowledge about 
transformation invariances by transforming SVs (Chapter 7), and the Reduced Set 
method (Chapter 18) for speeding up classification. Using these procedures, SVMs 
soon became competitive with the best available classifiers on OCR and other 
object recognition tasks [87, 57, 419, 438, 134], and later even achieved the world 
record on the main handwritten digit benchmark dataset [134]. 

An initial weakness of SVMs, less apparent in OCR applications which are 
characterized by low noise levels, was that the size of the quadratic programming 
problem (Chapter 10) scaled with the number of support vectors. This was due to 
the fact that in (1.36), the quadratic part contained at least all SVs — the common 
practice was to extract the SVs by going through the training data in chunks 
while regularly testing for the possibility that patterns initially not identified as 
SVs become SVs at a later stage. This procedure is referred to as chunking; note 
that without chunking, the size of the matrix in the quadratic part of the objective 
function would be m x m, where m is the number of all training examples. 

What happens if we have a high-noise problem? In this case, many of the 
slack variables é; become nonzero, and all the corresponding examples become 
SVs. For this case, decomposition algorithms were proposed [398, 409], based 
on the observation that not only can we leave out the non-SV examples (the x; 
with a; = 0) from the current chunk, but also some of the SVs, especially those 
that hit the upper boundary (a; = C). The chunks are usually dealt with using 
quadratic optimizers. Among the optimizers used for SVMs are LOQO [555], 
MINOS [380], and variants of conjugate gradient descent, such as the optimizers of 
Bottou [459] and Burges [85]. Several public domain SV packages and optimizers 
are listed on the web page http://www.kernel-machines.org. For more details on 
implementations, see Chapter 10. 

Once the SV algorithm had been generalized to regression, researchers started 
applying it to various problems of estimating real-valued functions. Very good 
results were obtained on the Boston housing benchmark [529], and on problems of 
times series prediction (see [376, 371, 351]). Moreover, the SV method was applied 
to the solution of inverse function estimation problems ([572]; cf. [563, 589]). For 
overviews, the interested reader is referred to [85, 472, 504, 125]. 


I CONCEPTS AND TOOLS 


The generic can be more intense than the concrete. 
J. L. Borges! 


We now embark on a more systematic presentation of the concepts and tools 
underlying Support Vector Machines and other kernel methods. 

In machine learning problems, we try to discover structure in data. For in- 
stance, in pattern recognition and regression estimation, we are given a training 
set (x1, Y1), «<, (Xm, Ym) € X x Y, and attempt to predict the outputs y for previ- 
ously unseen inputs x. This is only possible if we have some measure that tells us 
how (x, y) is related to the training set. Informally, we want similar inputs to lead 
to similar outputs.” To formalize this, we have to state what we mean by similar. 

A particularly simple yet surprisingly useful notion of similarity of inputs — the 
one we will use throughout this book — derives from embedding the data into 
a Euclidean feature space and utilizing geometrical concepts. Chapter 2 describes 
how certain classes of kernels induce feature spaces, and how one can compute 
dot products, and thus angles and distances, without having to explicitly work in 
these potentially infinite-dimensional spaces. This leads to a rather general class 
of similarity measure to be used on the inputs. 


1. From A History of Eternity, in The Total Library, Penguin, London, 2001. 

2. This procedure can be traced back to an old maxim of law: de similibus ad similia eadem 
ratione procedendum est — from things similar to things similar we are to proceed by the 
same rule. 
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On the outputs, similarity is usually measured in terms of a loss function stating 
how “bad” it is if the predicted y does not match the true one. The training 
of a learning machine commonly involves a risk functional that contains a term 
measuring the loss incurred for the training patterns. The concepts of loss and risk 
are introduced in depth in Chapter 3. 

This is not the full story, however. In order to generalize well to the test data, 
it is not sufficient to “explain” the training data. It is also necessary to control 
the complexity of the model used for explaining the training data, a task that is 
often accomplished with the help of regularization terms, as explained in Chapter 4. 
Specifically, one utilizes objective functions that involve both the empirical loss 
term and a regularization term. From a statistical point of view, we can expect 
the function minimizing a properly chosen objective function to work well on 
test data, as explained by statistical learning theory (Chapter 5). From a practical 
point of view, however, it is not at all straightforward to find this minimizer. 
Indeed, the quality of a loss function or a regularizer should be assessed not only 
on a statistical basis but also in terms of the feasibility of the objective function 
minimization problem. In order to be able to assess this, and in order to obtain 
a thorough understanding of practical algorithms for this task, we conclude this 
part of the book with an in-depth review of optimization theory (Chapter 6). 

The chapters in this part of the book assume familiarity with basic concepts 
of linear algebra and probability theory. Readers who would like to refresh their 
knowledge of these topics may want to consult Appendix B beforehand. 


Overview 


Prerequisites 


Kernels 


In Chapter 1, we described how a kernel arises as a similarity measure that can 
be thought of as a dot product in a so-called feature space. We tried to provide 
an intuitive understanding of kernels by introducing them as similarity measures, 
rather than immediately delving into the functional analytic theory of the classes 
of kernels that actually admit a dot product representation in a feature space. 

In the present chapter, we will be both more formal and more precise. We will 
study the class of kernels k that correspond to dot products in feature spaces H via 
amap ®, 


O:X 3H 

XH x:= D(x), (2.1) 
that is, 
k(x, x") = (®(x), P(x’). (2.2) 


Regarding the input domain X, we need not make assumptions other than it being 
a set. For instance, we could consider a set of discrete objects, such as strings. 

A natural question to ask at this point is what kind of functions k(x, x’) admit a 
representation of the form (2.2); that is, whether we can always construct a dot 
product space H and a map ® mapping into it such that (2.2) holds true. We 
shall begin, however, by trying to give some motivation as to why kernels are at 
all useful, considering kernels that compute dot products in spaces of monomial 
features (Section 2.1). Following this, we move on to the questions of how, given 
a kernel, an associated feature space can be constructed (Section 2.2). This leads to 
the notion of a Reproducing Kernel Hilbert Space, crucial for the theory of kernel 
machines. In Section 2.3, we give some examples and properties of kernels, and in 
Section 2.4, we discuss a class of kernels that can be used as dissimilarity measures 
rather than as similarity measures. 

The chapter builds on knowledge of linear algebra, as briefly summarized in 
Appendix B. Apart from that, it can be read on its own; however, readers new to 
the field will profit from first reading Sections 1.1 and 1.2. 
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2.1 Product Features 


Monomial 
Features 


In this section, we think of X as a subset of the vector space R", (N € N), endowed 
with the canonical dot product (1.3). 

Suppose we are given patterns x € X where most information is contained in 
the dth order products (so-called monomials) of entries [x]; of x, 


elj- Belj -e ejas (2.3) 


where j1,..., ja € {1,...,N}. Often, these monomials are referred to as product 
features. These features form the basis of many practical algorithms; indeed, there 
is a whole field of pattern recognition research studying polynomial classifiers [484], 
which is based on first extracting product features and then applying learning 
algorithms to these features. In other words, the patterns are preprocessed by 
mapping into the feature space H of all products of d entries. This has proven 
quite effective in visual pattern recognition tasks, for instance. To understand the 
rationale for doing this, note that visual patterns are usually represented as vectors 
whose entries are the pixel intensities. Taking products of entries of these vectors 
then corresponds to taking products of pixel intensities, and is thus akin to taking 
logical “and” operations on the pixels. Roughly speaking, this corresponds to the 
intuition that, for instance, a handwritten “8” constitutes an eight if there is a top 
circle and a bottom circle. With just one of the two circles, it is not half an “8,” but 
rather a “0.” Nonlinearities of this type are crucial for achieving high accuracies in 
pattern recognition tasks. 

Let us take a look at this feature map in the simple example of two-dimensional 
patterns, for which X = R. In this case, we can collect all monomial feature 
extractors of degree 2 in the nonlinear map 


o:R +H=R, (2.4) 
(xh, eh) > (Edt, 22, bee). (2.5) 


This approach works fine for small toy examples, but it fails for realistically sized 
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different monomials (2.3) of degree d, comprising a feature space H of dimension 
Ny. For instance, 16 x 16 pixel input images and a monomial degree d = 5 thus 
yield a dimension of almost 10”. 

In certain cases described below, however, there exists a way of computing dot 
products in these high-dimensional feature spaces without explicitly mapping into 
the spaces, by means of kernels nonlinear in the input space RY. Thus, if the 
subsequent processing can be carried out using dot products exclusively, we are 
able to deal with the high dimension. 

We now describe how dot products in polynomial feature spaces can be com- 
puted efficiently, followed by a section in which we discuss more general feature 
spaces. In order to compute dot products of the form (®(x), ®(x’)), we employ 
kernel representations of the form 


k(x, x!) = (P(x), Px’), (2.7) 


which allow us to compute the value of the dot product in H without having to 
explicitly compute the map ®. 

What does k look like in the case of polynomial features? We start by giving an 
example for N = d = 2, as considered above [561]. For the map 


® : (Exh, [x]2) > ([x]7, 15, hll, ll), (2.8) 


(note that for now, we have considered [x];[x]2 and [x]2[x]; as separate features; 
thus we are looking at ordered monomials) dot products in H take the form 


(®(x), D) = KRR + BEG + 2beh bebe = (x, 2"). (2.9) 


In other words, the desired kernel k is simply the square of the dot product in 
input space. The same works for arbitrary N,d € N [62]: as a straightforward 
generalization of a result proved in the context of polynomial approximation [412, 
Lemma 2.1], we have: 


Proposition 2.1 Define C4 to map x € R to the vector C,(x) whose entries are all 
possible dth degree ordered products of the entries of x. Then the corresponding kernel 
computing the dot product of vectors mapped by C4 is 


d 


k(x, x’) = (Ca(x), Ca(x')) = (x, x’) ‘ (2.10) 


Proof We directly compute 


N N 
(Ca(x), C,(x’)) = 5 nee 5 [x]; SEP [x] j, [ele ore x] (2.11) 


j=l ja=1 
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Note that we used the symbol C; for the feature map. The reason for this is that we 
would like to reserve ®, for the corresponding map computing unordered product 
features. Let us construct such a map ®,, yielding the same value of the dot 
product. To this end, we have to compensate for the multiple occurrence of certain 
monomials in C4 by scaling the respective entries of P4 with the square roots of 
their numbers of occurrence. Then, by this construction of ®;, and (2.10), 


(Dal), Dah) = (Cala), Cal’) = (x, x. (2.12) 


For instance, if n of the j; in (2.3) are equal, and the remaining ones are different, 
then the coefficient in the corresponding component of ®; is \/(d — n + 1)!. For the 
general case, see Problem 2.2. For ®2, this simply means that [561] 


(x) = ([x]}, [x]3, V2 [xh [x]). (2.13) 


The above reasoning illustrates an important point pertaining to the construction 
of feature spaces associated with kernel functions. Although they map into dif- 
ferent feature spaces, ®y and C, are both valid instantiations of feature maps for 
k(x, x’) = (x, xf. 

To illustrate how monomial feature kernels can significantly simplify pattern 
recognition tasks, let us consider a simple toy example. 


Example 2.2 (Monomial Features in 2-D Pattern Recognition) In the example of 
Figure 2.1, a non-separable problem is reduced to the construction of a separating hy- 
perplane by preprocessing the input data with ®. As we shall see in later chapters, this 
has advantages both from the computational point of view (there exist efficient algo- 
rithms for computing the hyperplane) and from the statistical point of view (there exist 
guarantees for how well the hyperplane will generalize to unseen test points). 


In more realistic cases, e.g., if x represents an image with the entries being pixel 
values, polynomial kernels (x,x’)" enable us to work in the space spanned by 
products of any d pixel values — provided that we are able to do our work solely 
in terms of dot products, without any explicit usage of a mapped pattern ®4(x). 
Using kernels of the form (2.10), we can take higher-order statistics into account, 
without the combinatorial explosion (2.6) of time and memory complexity which 
accompanies even moderately high N and d. 

To conclude this section, note that it is possible to modify (2.10) such that it maps 
into the space of all monomials up to degree d, by defining k(x, x!) = ((x, x!) + 1)" 
(Problem 2.17). Moreover, in practice, it is often useful to multiply the kernel by a 
scaling factor c to ensure that its numeric range is within some bounded interval, 
say [—1, 1]. The value of c will depend on the dimension and range of the data. 
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Figure 2.1 Toy example of a binary classification problem mapped into feature space. We 
assume that the true decision boundary is an ellipse in input space (left panel). The task 
of the learning process is to estimate this boundary based on empirical data consisting of 
training points in both classes (crosses and circles, respectively). When mapped into feature 
space via the nonlinear map ®2(x) = (21, 22,23) = (K, [x]5, V2 [xh[xh) (right panel), the 
ellipse becomes a hyperplane (in the present simple case, it is parallel to the z3 axis, hence 
all points are plotted in the (z;, z2) plane). This is due to the fact that ellipses can be written 
as linear equations in the entries of (21, 22,23). Therefore, in feature space, the problem 
reduces to that of estimating a hyperplane from the mapped data points. Note that via the 
polynomial kernel (see (2.12) and (2.13)), the dot product in the three-dimensional space 
can be computed without computing ®). Later in the book, we shall describe algorithms 
for constructing hyperplanes which are based on dot products (Chapter 7). 


2.2 The Representation of Similarities in Linear Spaces 


In what follows, we will look at things the other way round, and start with the 
kernel rather than with the feature map. Given some kernel, can we construct a 
feature space such that the kernel computes the dot product in that feature space; 
that is, such that (2.2) holds? This question has been brought to the attention 
of the machine learning community in a variety of contexts, especially during 
recent years [4, 152, 62, 561, 480]. In functional analysis, the same problem has 
been studied under the heading of Hilbert space representations of kernels. A good 
monograph on the theory of kernels is the book of Berg, Christensen, and Ressel 
[42]; indeed, a large part of the material in the present chapter is based on this 
work. We do not aim to be fully rigorous; instead, we try to provide insight into 
the basic ideas. As a rule, all the results that we state without proof can be found 
in [42]. Other standard references include [16, 455]. 

There is one more aspect in which this section differs from the previous one: 
the latter dealt with vectorial data, and the domain X was assumed to be a subset 
of RY. By contrast, the results in the current section hold for data drawn from 
domains which need no structure, other than their being nonempty sets. This 
generalizes kernel learning algorithms to a large number of situations where a 
vectorial representation is not readily available, and where one directly works 
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with pairwise distances or similarities between non-vectorial objects [246, 467, 
154, 210, 234, 585]. This theme will recur in several places throughout the book, 
for instance in Chapter 13. 


2.2.1 Positive Definite Kernels 


We start with some basic definitions and results. As in the previous chapter, indices 
i and j are understood to run over 1,..., m. 


Definition 2.3 (Gram Matrix) Given a function k : X? + K (where K = C or K= RD 


and patterns xX1,...,Xm € X, the m x m matrix K with elements 
Kij := k(xi, xj) (2.14) 
is called the Gram matrix (or kernel matrix) of k with respect to x1,...,Xm- 


Definition 2.4 (Positive Definite Matrix) A complex m x m matrix K satisfying 


i,j 
for all c; € C is called positive definite.! Similarly, a real symmetric m x m matrix K 
satisfying (2.15) for all c; € R is called positive definite. 


Note that a symmetric matrix is positive definite if and only if all its eigenvalues 
are nonnegative (Problem 2.4). The left hand side of (2.15) is often referred to as 
the quadratic form induced by K. 


Definition 2.5 ((Positive Definite) Kernel) Let X be a nonempty set. A function k on 
X x X which for all m € N and all x1...,Xm E€ X gives rise to a positive definite Gram 
matrix is called a positive definite (pd) kernel. Often, we shall refer to it simply as a 
kernel. 


Remark 2.6 (Terminology) The term kernel stems from the first use of this type of 
function in the field of integral operators as studied by Hilbert and others [243, 359, 112]. 
A function k which gives rise to an operator Tẹ via 


TA) = L k(x, x!) f(x!) dx! (2.16) 


is called the kernel of Ty. 

In the literature, a number of different terms are used for positive definite kernels, such 
as reproducing kernel, Mercer kernel, admissible kernel, Support Vector kernel, 
nonnegative definite kernel, and covariance function. One might argue that the term 
positive definite kernel is slightly misleading. In matrix theory, the term definite is 
sometimes reserved for the case where equality in (2.15) only occurs if cy =... = Cm = 0. 


1. The bar in ¢; denotes complex conjugation; for real numbers, it has no effect. 
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Simply using the term positive kernel, on the other hand, could be mistaken as referring 
to a kernel whose values are positive. Finally, the term positive semidefinite kernel 
becomes rather cumbersome if it is to be used throughout a book. Therefore, we follow 
the convention used for instance in [42], and employ the term positive definite both for 
kernels and matrices in the way introduced above. The case where the value 0 is only 
attained if all coefficients are 0 will be referred to as strictly positive definite. 

We shall mostly use the term kernel. Whenever we want to refer to a kernel k(x, x’) 
which is not positive definite in the sense stated above, it will be clear from the context. 


The definitions for positive definite kernels and positive definite matrices differ in 

the fact that in the former case, we are free to choose the points on which the kernel 

is evaluated — for every choice, the kernel induces a positive definite matrix. 
Positive definiteness implies positivity on the diagonal (Problem 2.12), 


k(x,x) > 0 for all x € X, (2.17) 


and symmetry (Problem 2.13), 


k(xi, Xj) = k(xj, Xi). (2.18) 
To also cover the complex-valued case, our definition of symmetry includes com- 
plex conjugation. The definition of symmetry of matrices is analogous; that is, 
K; p= K ji 

For real-valued kernels it is not sufficient to stipulate that (2.15) hold for real 
coefficients c;. To get away with real coefficients only, we must additionally require 
that the kernel be symmetric (Problem 2.14); k(x;, xj) = k(x j, xi) (cf. Problem 2.13). 

It can be shown that whenever k is a (complex-valued) positive definite kernel, 
its real part is a (real-valued) positive definite kernel. Below, we shall largely be 
dealing with real-valued kernels. Most of the results, however, also apply for 
complex-valued kernels. 

Kernels can be regarded as generalized dot products. Indeed, any dot product 
is a kernel (Problem 2.5); however, linearity in the arguments, which is a standard 
property of dot products, does not carry over to general kernels. However, another 
property of dot products, the Cauchy-Schwarz inequality, does have a natural 
generalization to kernels: 


Proposition 2.7 (Cauchy-Schwarz Inequality for Kernels) If k is a positive definite 
kernel, and x,,x2 € X, then 


|k(x1, %2)|? < k(x1,x1) + k(x2, x2). (2.19) 


Proof For sake of brevity, we give a non-elementary proof using some basic facts 
of linear algebra. The 2 x 2 Gram matrix with entries K;; = k(x;, x;) (i, j € {1,2} 
is positive definite. Hence both its eigenvalues are nonnegative, and so is their 
product, the determinant of K. Therefore 


0 < Ky Kyo — Ki2Kn = Ku Ky — KK = Ku Ka — |K|’. (2.20) 
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Figure 2.2 One instantiation of the fea- 
ture map associated with a kernel is the 


map (2.21), which represents each pattern 

foo Ye * (in the picture, x or x’) by a kernel-shaped 
function sitting on the pattern. In this sense, 

each pattern is represented by its similar- 

ye ity to all other patterns. In the picture, the 

kernel is assumed to be bell-shaped, e.g., a 


x x" Ox) Da’) Gaussian k(x, x’) = exp(—|[x — x'|P/2 o’)). 
In the text, we describe the construction of 
a dot product (.,.) on the function space 
such that k(x, x!) = (®(x), B(x’). 


Substituting k(x;, x;) for Kij, we get the desired inequality. a 


We now show how the feature spaces in question are defined by the choice of 
kernel function. 


2.2.2 The Reproducing Kernel Map 


Assume that k is a real-valued positive definite kernel, and X a nonempty set. We 
define a map from X into the space of functions mapping X into R, denoted as 
RY := {f : X > R}, via 
©: X— RY 
x k(., x). (2.21) 

Here, ®(x) denotes the function that assigns the value k(x’,x) to x’ € X, i.e., 
@(x)(.) = k(., x) (as shown in Figure 2.2). 

We have thus turned each pattern into a function on the domain X. In this sense, 
a pattern is now represented by its similarity to all other points in the input domain 
X. This seems a very rich representation; nevertheless, it will turn out that the 
kernel allows the computation of the dot product in this representation. Below, 
we show how to construct a feature space associated with ®, proceeding in the 
following steps: 
1. Turn the image of ® into a vector space, 
2. define a dot product; that is, a strictly positive definite bilinear form, and 
3. show that the dot product satisfies k(x, x’) = (®(x), B(x’). 


We begin by constructing a dot product space containing the images of the input 
patterns under ®. To this end, we first need to define a vector space. This is done 
by taking linear combinations of the form 


f0) = 3 E (2.22) 
i=1 


Here, m € N a; € Rand x1,..., Xm € X are arbitrary. Next, we define a dot product 
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between f and another function 


8.) = > Bik x4), (2.23) 
j=l 
where m’ € N 6; € Rand x},...,x/,, € X, as 
fa) = 2 2 aibjk(xi, x’). (2.24) 
1=1 J= 


This expression explicitly contains the expansion coefficients, which need not be 
unique. To see that it is nevertheless well-defined, note that 


m' 


F8) = > bf), (2.25) 
j=1 


using k(x}, Xi) = k(x;, x1). The sum in (2.25), however, does not depend on the 
particular expansion of f. Similarly, for g, note that 


(f,8) = È, aig(xi). (2.26) 


i=1 

The last two equations also show that (-,-) is bilinear. It is symmetric, as (f, g) = 
(g, f}. Moreover, it is positive definite, since positive definiteness of k implies that 
for any function f, written as (2.22), we have 


m 


ff) = > aiajk(xi, xj) > 0. (2.27) 


i, j=1 


The latter implies that (-,-) is actually itself a positive definite kernel, defined 
on our space of functions. To see this, note that given functions fi,..., fn, and 
coefficients 71,..., in E€ R, we have 


Xi Gt )= (So Sa) >0. (2.28) 
1,J= 1= J= 


Here, the left hand equality follows from the bilinearity of (-, -), and the right hand 
inequality from (2.27). For the last step in proving that it qualifies as a dot product, 
we will use the following interesting property of ®, which follows directly from 
the definition: for all functions (2.22), we have 


(k(., x), f) = F(x) (2.29) 
— kis the representer of evaluation. In particular, 
(k(., x), k(., x) = k(x, x’). (2.30) 


By virtue of these properties, positive definite kernels k are also called reproducing 
kernels [16, 42, 455, 578, 467, 202]. By (2.29) and Proposition 2.7, we have 


IFO = KC), Ff) P < klx, x) F, fY- (2.31) 
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Therefore, (f, f} = 0 directly implies f = 0, which is the last property that required 
proof in order to establish that (-, -) is a dot product (cf. Section B.2). 

The case of complex-valued kernels can be dealt with using the same construc- 
tion; in that case, we will end up with a complex dot product space [42]. 

The above reasoning has shown that any positive definite kernel can be thought 
of as a dot product in another space: in view of (2.21), the reproducing kernel 
property (2.30) amounts to 


(P(x), B(x!) = k(x, x’). (2.32) 


Therefore, the dot product space H constructed in this way is one possible instan- 
tiation of the feature space associated with a kernel. 

Above, we have started with the kernel, and constructed a feature map. Let us 
now consider the opposite direction. Whenever we have a mapping ® from X into 
a dot product space, we obtain a positive definite kernel via k(x, x’) := (P(x), ®(x’)). 
This can be seen by noting that for all c; € R, x; € X,i=1,...,m, we have 


È eiejk(xi, xj)= (z ci®(x;), Zeen) = 
i j 


ij 


2 
>0, (2.33) 


» c;®(x;) 


due to the nonnegativity of the norm. 

This has two consequences. First, it allows us to give an equivalent definition of 
positive definite kernels as functions with the property that there exists a map 
® into a dot product space such that (2.32) holds true. Second, it allows us to 
construct kernels from feature maps. For instance, it is in this way that powerful 
linear representations of 3D heads proposed in computer graphics [575, 59] give 
rise to kernels. The identity (2.32) forms the basis for the kernel trick: 


Remark 2.8 (“Kernel Trick”) Given an algorithm which is formulated in terms of a 
positive definite kernel k, one can construct an alternative algorithm by replacing k by 
another positive definite kernel k. 


In view of the material in the present section, the justification for this procedure is 
the following: effectively, the original algorithm can be thought of as a dot prod- 
uct based algorithm operating on vectorial data ®(x1),...,®(%). The algorithm 
obtained by replacing k by k then is exactly the same dot product based algorithm, 
only that it operates on ®(x1),..., B(X). 

The best known application of the kernel trick is in the case where k is the dot 
product in the input domain (cf. Problem 2.5). The trick is not limited to that case, 
however: k and Š can both be nonlinear kernels. In general, care must be exercised 
in determining whether the resulting algorithm will be useful: sometimes, an 
algorithm will only work subject to additional conditions on the input data, e.g., 
the data set might have to lie in the positive orthant. We shall later see that certain 
kernels induce feature maps which enforce such properties for the mapped data 
(cf. (2.73)), and that there are algorithms which take advantage of these aspects 
(e.g., in Chapter 8). In such cases, not every conceivable positive definite kernel 
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will make sense. 

Even though the kernel trick had been used in the literature for a fair amount of 
time [4, 62], it took until the mid 1990s before it was explicitly stated that any al- 
gorithm that only depends on dot products, i.e., any algorithm that is rotationally 
invariant, can be kernelized [479, 480]. Since then, a number of algorithms have 
benefitted from the kernel trick, such as the ones described in the present book, as 
well as methods for clustering in feature spaces [479, 215, 199]. 

Moreover, the machine learning community took time to comprehend that the 
definition of kernels on general sets (rather than dot product spaces) greatly 
extends the applicability of kernel methods [467], to data types such as texts and 
other sequences [234, 585, 23]. Indeed, this is now recognized as a crucial feature 
of kernels: they lead to an embedding of general data types in linear spaces. 

Not surprisingly, the history of methods for representing kernels in linear spaces 
(in other words, the mathematical counterpart of the kernel trick) dates back 
significantly further than their use in machine learning. The methods appear to 
have first been studied in the 1940s by Kolmogorov [304] for countable X and 
Aronszajn [16] in the general case. Pioneering work on linear representations of 
a related class of kernels, to be described in Section 2.4, was done by Schoenberg 
[465]. Further bibliographical comments can be found in [42]. 

We thus see that the mathematical basis for kernel algorithms has been around 
for a long time. As is often the case, however, the practical importance of mathe- 
matical results was initially underestimated.” 


2.2.3 Reproducing Kernel Hilbert Spaces 


In the last section, we described how to define a space of functions which is a 
valid realization of the feature spaces associated with a given kernel. To do this, 
we had to make sure that the space is a vector space, and that it is endowed with 
a dot product. Such spaces are referred to as dot product spaces (cf. Appendix B), 
or equivalently as pre-Hilbert spaces. The reason for the latter is that one can turn 
them into Hilbert spaces (cf. Section B.3) by a fairly simple mathematical trick. This 
additional structure has some mathematical advantages. For instance, in Hilbert 
spaces it is always possible to define projections. Indeed, Hilbert spaces are one of 
the favorite concepts of functional analysis. 

So let us again consider the pre-Hilbert space of functions (2.22), endowed with 
the dot product (2.24). To turn it into a Hilbert space (over R), one completes it in 
the norm corresponding to the dot product, || f||:= /(f, f}. This is done by adding 
the limit points of sequences that are convergent in that norm (see Appendix B). 


2. This is illustrated by the following quotation from an excellent machine learning text- 
book published in the seventies (p. 174 in [152]): “The familiar functions of mathematical physics 
are eigenfunctions of symmetric kernels, and their use is often suggested for the construction of po- 
tential functions. However, these suggestions are more appealing for their mathematical beauty than 
their practical usefulness.” 
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In view of the properties (2.29) and (2.30), this space is usually called a reproducing 
kernel Hilbert space (RKHS). 
In general, an RKHS can be defined as follows. 


Definition 2.9 (Reproducing Kernel Hilbert Space) Let X be a nonempty set (often 
called the index set) and by K a Hilbert space of functions f : X —> R. Then H is called 
a reproducing kernel Hilbert space endowed with the dot product (-,-) (and the norm 
If || = Cf, fY if there exists a function k : X x X —> R with the following properties. 


1. k has the reproducing property? 


(f(x, +)) = f(x) for all f € K; (2.34) 
in particular, 
(k(x, -), k(x’, ‘)) = k(x, x’). (2.35) 


2. k spans H, i.e. H = span {k(x, -)|x € X} where X denotes the completion of the set X 
(cf. Appendix B). 


On a more abstract level, an RKHS can be defined as a Hilbert space of functions 
f on X such that all evaluation functionals (the maps f > f(x’), where x’ € X) are 
continuous. In that case, by the Riesz representation theorem (e.g., [429]), for each 
x' € X there exists a unique function of x, called k(x, x’), such that 


f) = (f, kx). (2.36) 


It follows directly from (2.35) that k(x, x’) is symmetric in its arguments (see 
Problem 2.28) and satisfies the conditions for positive definiteness. 

Note that the RKHS uniquely determines k. This can be shown by contradiction: 
assume that there exist two kernels, say k and k’, spanning the same RKHS KX. 
From Problem 2.28 we know that both k and k’ must be symmetric. Moreover, 
from (2.34) we conclude that 


(k(x, +), K, +) ge = kx, x’) = ki", x). (2.37) 
In the second equality we used the symmetry of the dot product. Finally, symme- 
try in the arguments of k yields k(x, x’) = k'(x, x’) which proves our claim. 


2.2.4 The Mercer Kernel Map 


Section 2.2.2 has shown that any positive definite kernel can be represented as a 
dot product in a linear space. This was done by explicitly constructing a (Hilbert) 
space that does the job. The present section will construct another Hilbert space. 


3. Note that this implies that each f € H is actually a single function whose values at any 
x € X are well-defined. In contrast, L Hilbert spaces usually do not have this property. The 
elements of these spaces are equivalence classes of functions that disagree only on sets of 
measure 0; cf. footnote 15 in Section B.3. 
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One could argue that this is superfluous, given that any two separable Hilbert 
spaces are isometrically isomorphic, in other words, it is possible to define a one- 
to-one linear map between the spaces which preserves the dot product. However, 
the tool that we shall presently use, Mercer’s theorem, has played a crucial role 
in the understanding of SVMs, and it provides valuable insight into the geometry 
of feature spaces, which more than justifies its detailed discussion. In the SVM 
literature, the kernel trick is usually introduced via Mercer’s theorem. 

We start by stating the version of Mercer’s theorem given in [606]. We assume 
(X, u) to be a finite measure space.* The term almost all (cf. Appendix B) means 
except for sets of measure zero. For the commonly used Lebesgue-Borel measure, 
countable sets of individual points are examples of zero measure sets. Note that 
the integral with respect to a measure is explained in Appendix B. Readers who 
do not want to go into mathematical detail may simply want to think of the dju(x’) 
as a dx’, and of X as a compact subset of R". For further explanations of the terms 
involved in this theorem, cf. Appendix B, especially Section B.3. 


Theorem 2.10 (Mercer [359, 307]) Suppose k € L..(X?) is a symmetric real-valued 
function such that the integral operator (cf. (2.16)) 


Tk : La(X) > La(X) 


(Tif (x) = L k(x, x') f(x’) du(x') (2.38) 
is positive definite; that is, for all f € Lo(X), we have 
[Rex FC fle dudu > 0. (2.39) 


Let p; € La(X) be the normalized orthogonal eigenfunctions of T, associated with the 
eigenvalues Aj > 0, sorted in non-increasing order. Then 


1. (Aj); Eh, 


2. k(x, x’) = pp Aj i(x)y (x’) holds for almost all (x, x’). Either Nyc € N, or Nyc = 00; 
in the latter case, the series converges absolutely and uniformly for almost all (x, x’). 


For the converse of Theorem 2.10, see Problem 2.23. For a data-dependent approx- 
imation and its relationship to kernel PCA (Section 1.7), see Problem 2.26. 
From statement 2 it follows that k(x, x’) corresponds to a dot product in ¢)’", 
since k(x, x’) = (B(x), ®(x’)) with 
O:xX = J 
KY (VARE) j=1,..4N9e> 


for almost all x € X. Note that we use the same ® as in (2.21) to denote the feature 


(2.40) 


4. A finite measure space is a set X with a o-algebra (Definition B.1) defined on it, and a 
measure (Definition B.2) defined on the latter, satisfying u(X) < 00 (so that, up to a scaling 
factor, u is a probability measure). 
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map, although the target spaces are different. However, this distinction is not 
important for the present purposes — we are interested in the existence of some 
Hilbert space in which the kernel corresponds to the dot product, and not in what 
particular representation of it we are using. 

In fact, it has been noted [467] that the uniform convergence of the series implies 
that given any e > 0, there exists an n € N such that even if Ny = 00, k can be 
approximated within accuracy € as a dot product in R”: for almost all x, x! € 
X, |k(x, x’) — (B"(x), B"(x!)) | < e, where @" : x (MY (x), -- -, Wn n(x). The 
feature space can thus always be thought of as finite-dimensional within some 
accuracy e. We summarize our findings in the following proposition. 


Proposition 2.11 (Mercer Kernel Map) If k is a kernel satisfying the conditions of 
Theorem 2.10, we can construct a mapping ® into a space where k acts as a dot product, 


(P(x), B(x’) = k(x, x’), (2.41) 


for almost all x,x' € X. Moreover, given any e > 0, there exists a map ®, into an n- 
dimensional dot product space (where n € N depends on e) such that 


k(x, x’) — (B(x), B"(x')) |< (2.42) 
for almost all x, x’ € X. 
Both Mercer kernels and positive definite kernels can thus be represented as dot 


products in Hilbert spaces. The following proposition, showing a case where the 
two types of kernels coincide, thus comes as no surprise. 


Proposition 2.12 (Mercer Kernels are Positive Definite [359, 42]) Let X = [a,b] be 
a compact interval and let k : [a,b] x [a,b] > C be continuous. Then k is a positive definite 
kernel if and only if 


b pb 
I I k(x, x!) fx) f(x!) dx dx! > 0 (2.43) 
for each continuous function f : X > C. 


Note that the conditions in this proposition are actually more restrictive than 
those of Theorem 2.10. Using the feature space representation (Proposition 2.11), 
however, it is easy to see that Mercer kernels are also positive definite (for almost 
all x, x’ € X) in the more general case of Theorem 2.10: given any c € R”, we have 


2 
X cicjk(xi, x;) = $ cic; (®(x;), P(x;))) = > 0. (2.44) 
ij ij 


$ a(x) 


Being positive definite, Mercer kernels are thus also reproducing kernels. 

We next show how the reproducing kernel map is related to the Mercer kernel 
map constructed from the eigenfunction decomposition [202, 467]. To this end, let 
us consider a kernel which satisfies the condition of Theorem 2.10, and construct 
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a dot product (-,-) such that k becomes a reproducing kernel for the Hilbert space 
H containing the functions 


o0 o0 Ny 
f(x) = ¥ aik(x, x) = ¥ a $, Ahaa). (2.45) 
i=1 i=l j= 
By linearity, which holds for any dot product, we have 
oe) Ny 
(F, k(., x") = 5 Qi 5 Aj; (xi) (js Wn) AnWn(X’). (2.46) 
i=l jyn=1 
Since k is a Mercer kernel, the ~; (i = 1,..., Ngc) can be chosen to be orthogonal 


with respect to the dot product in L2(X). Hence it is straightforward to choose (+, +} 
such that 


(Wj, Un) = Sin/ Aj (2.47) 


(using the Kronecker symbol 6j;, see (B.30)), in which case (2.46) reduces to the 
reproducing kernel property (2.36) (using (2.45)). For a coordinate representation 
in the RKHS, see Problem 2.29. 

The above connection between the Mercer kernel map and the RKHS map is 
instructive, but we shall rarely make use of it. In fact, we will usually identify 
the different feature spaces. Thus, to avoid confusion in subsequent chapters, the 
following comments are necessary. As described above, there are different ways 
of constructing feature spaces for any given kernel. In fact, they can even differ in 
terms of their dimensionality (cf. Problem 2.22). The two feature spaces that we 
will mostly use in this book are the RKHS associated with k (Section 2.2.2) and 
the Mercer 2 feature space. We will mostly use the same symbol 4 for all feature 
spaces that are associated with a given kernel. This makes sense provided that 
everything we do, at the end of the day, reduces to dot products. For instance, let 
us assume that ®,,@, are maps into the feature spaces H1, Hz respectively, both 
associated with the kernel k; in other words, 


k(x, x!) = (B(x), B{x'))4, , for i = 1,2. (2.48) 


Then it will usually not be the case that ®;(x) = P(x); due to (2.48), however, 
we always have (®(x), ®1(x’))4,, = (P(x), P2(x’)) 4,,. Therefore, as long as we are 
only interested in dot products, the two spaces can be considered identical. 

An example of this identity is the so-called large margin regularizer that is 
usually used in SVMs, as discussed in the introductory chapter (cf. also Chapters 
4and 7), 


(w,w), where w = X a; ®(x;). (2.49) 
i=1 
No matter whether ® is the RKHS map ©(x;) = k(., xj) (2.21) or the Mercer map 
D(x) = (VAi) j=,.. Ns (2-40), the value of ||w||* will not change. 
This point is of great importance, and we hope that all readers are still with us. 
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It is fair to say, however, that Section 2.2.5 can be skipped at first reading. 
2.2.5 The Shape of the Mapped Data in Feature Space 


Using Mercer’s theorem, we have shown that one can think of the feature map 
as a map into a high- or infinite-dimensional Hilbert space. The argument in the 
remainder of the section shows that this typically entails that the mapped data 
@(X) lie in some box with rapidly decaying side lengths [606]. By this we mean 
that the range of the data decreases as the dimension index j increases, with a rate 
that depends on the size of the eigenvalues. 

Let us assume that for all j € N, we have sup ey Aj|w;(x)|? < 00. Define the 
sequence 


1, := sup Ajlo. (2.50) 
xEX 
Note that if 
Cy := sup sup |w;(x)| (2.51) 
j xex 


exists (see Problem 2.24), then we have 1; < A;C?. However, if the Aj decay rapidly, 
then (2.50) can be finite even if (2.51) is not. 

By construction, ®(X) is contained in an axis parallel parallelepiped in ¢)’" with 
side lengths 2 ,/T; (cf. (2.40)).° 

Consider an example of a common kernel, the Gaussian, and let u (see The- 
orem 2.10) be the Lebesgue measure. In this case, the eigenvectors are sine and 
cosine functions (with supremum one), and thus the sequence of the l; coincides 
with the sequence of the eigenvalues Aj. Generally, whenever sup ey |~;(x)| is fi- 
nite, the /; decay as fast as the Aj. We shall see in Sections 4.4, 4.5 and Chapter 12 
that for many common kernels, this decay is very rapid. 

It will be useful to consider operators that map ®(X) into balls of some radius 
R centered at the origin. The following proposition characterizes a class of such 
operators, determined by the sequence (Ij) jc. Recall that R denotes the space of 
all real sequences. 


Proposition 2.13 (Mapping ®(X) into ¢2) Let S be the diagonal map 
S: P > P 
(x); = S(x) = (sjxj)j, 
where (sj); € R. If (sivi); € b, then S maps ®(X) into a ball centered at the origin 
whose radius is R = | (v/i); 


(2.52) 


5. In fact, it is sufficient to use the essential supremum in (2.50). In that case, subsequent 
statements also only hold true almost everywhere. 
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Continuity of ® 


Proof Suppose (s ivi); € >. Using the Mercer map (2.40), we have 


ISEI = E FAGO < È 571) =R (2.53) 
jEN jEN 
for any x € X. Hence S@®(X) C 42. E 


The converse is not necessarily the case. To see this, note that if (s ivl); €b, 
amounting to saying that 


Zs sup Ailo (2.54) 
J xe 


is not finite, then there need not always exist an x € X such that S@(x) = 


(8; /Ajv(x)), g bo, i.e., that 
Vs Ali? (2.55) 
J 


is not finite. 

To see how the freedom to rescale ®(X) effectively restricts the class of functions 
we are using, we first note that everything in the feature space H = 4Y% is done 
in terms of dot products. Therefore, we can compensate any invertible symmetric 
linear transformation of the data in H by the inverse transformation on the set 
of admissible weight vectors in H. In other words, for any invertible symmetric 
operator S on H, we have (S~'w, S®(x)) = (w, ®(x)) for all x € X. 

As we shall see below (cf. Theorem 5.5, Section 12.4, and Problem 7.5), there 
exists a class of generalization error bound that depends on the radius R of the 
smallest sphere containing the data. If the (l;); decay rapidly, we are not actually 
“making use” of the whole sphere. In this case, we may construct a diagonal 
scaling operator S which inflates the sides of the above parallelepiped as much 
as possible, while ensuring that it is still contained within a sphere of the original 
radius R in H (Figure 2.3). By effectively reducing the size of the function class, this 
will provide a way of strengthening the bounds. A similar idea, using kernel PCA 
(Section 14.2) to determine empirical scaling coefficients, has been successfully 
applied by [101]. 

We conclude this section with another useful insight that characterizes a prop- 
erty of the feature map ®. Note that most of what was said so far applies to the 
case where the input domain X is a general set. In this case, it is not possible to 
make nontrivial statements about continuity properties of ®. This changes if we 
assume X to be endowed with a notion of closeness, by turning it into a so-called 
topological space. Readers not familiar with this concept will be reassured to hear 
that Euclidean vector spaces are particular cases of topological spaces. 


Proposition 2.14 (Continuity of the Feature Map [402]) If X is a topological space 
and k is a continuous positive definite kernel on X x X, then there exists a Hilbert space 
KH and a continuous map ® : X — H such that for all x,x' € X, we have k(x, x’) = 
(®(x), D(x). 
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Figure 2.3 Since everything is done in terms of dot products, scaling up the data by 
an operator S can be compensated by scaling the weight vectors with S7! (cf. text). By 
choosing S such that the data are still contained in a ball of the same radius R, we effectively 
reduce our function class (parametrized by the weight vector), which can lead to better 
generalization bounds, depending on the kernel inducing the map ®. 


2.2.6 The Empirical Kernel Map 


The map ®, defined in (2.21), transforms each input pattern into a function on X, 
that is, into a potentially infinite-dimensional object. For any given set of points, 
however, it is possible to approximate ® by only evaluating it on these points (cf. 
[232, 350, 361, 547, 474)]): 


Definition 2.15 (Empirical Kernel Map) For a given set {z1,...,Zn} CX, n EN, we 
call 


Dp : RY > R" where x +4 K(.,X)|f2,,..42} = (21, 2)5+ ++ k(n, x)" (2.56) 


the empirical kernel map w.r.t. {Z1,...,Zn}. 


As an example, consider first the case where k is a positive definite kernel, and 
{Z1,..-,Zn} = {%1,...,Xm}; we thus evaluate k(., x) on the training patterns. If we 
carry out a linear algorithm in feature space, then everything will take place in 
the linear span of the mapped training patterns. Therefore, we can represent the 
k(., x) of (2.21) as ®,,(x) without losing information. The dot product to use in that 
representation, however, is not simply the canonical dot product in R”, since the 
(x;) will usually not form an orthonormal system. To turn ®, into a feature map 


associated with k, we need to endow R” with a dot product (-, -),,, such that 


k(x, x’) = (On (x), Dn(x’)) (2.57) 


m* 


To this end, we use the ansatz (-,-),,, = (-,M-), with M being a positive definite 
matrix. Enforcing (2.57) on the training patterns, this yields the self-consistency 
condition [478, 512] 


K=KMK, (2.58) 


6. Every dot product in R” can be written in this form. We do not require strict definiteness 
of M, as the null space can be projected out, leading to a lower-dimensional feature space. 
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Kernel PCA Map 


where K is the Gram matrix. The condition (2.58) can be satisfied for instance 
by the (pseudo-)inverse M = K~!. Equivalently, we could have incorporated this 
rescaling operation, which corresponds to a Kernel PCA “whitening” ([478, 547, 
474], cf. Section 11.4), directly into the map, by whitening (2.56) to get 


B® : x K2 (k(x1, x), «- , Km, X). (2.59) 


This simply amounts to dividing the eigenvector basis vectors of K by \/A;, where 
the A; are the eigenvalues of K.” This parallels the rescaling of the eigenfunctions 
of the integral operator belonging to the kernel, given by (2.47). It turns out that 
this map can equivalently be performed using kernel PCA feature extraction (see 
Problem 14.8), which is why we refer to this map as the kernel PCA map. 

Note that we have thus constructed a data-dependent feature map into an m- 
dimensional space which satisfies (®% (x), BY (x’)) = k(x, x’), i.e., we have found an 
m-dimensional feature space associated with the given kernel. In the case where K 
is invertible, ®} (x) computes the coordinates of (x) when represented in a basis 
of the m-dimensional subspace spanned by ®(x1),..., ®(Xm). 

For data sets where the number of examples is smaller than their dimension, 
it can actually be computationally attractive to carry out ®% explicitly, rather 
than using kernels in subsequent algorithms. Moreover, algorithms which are not 
readily “kernelized” may benefit from explicitly carrying out the kernel PCA map. 

We end this section with two notes which illustrate why the use of (2.56) need 
not be restricted to the special case we just discussed. 


= More general kernels. When using non-symmetric kernels k in (2.56), together with 
the canonical dot product, we effectively work with the positive definite matrix 
K'K. Note that each positive definite matrix can be written as K'K. Therefore, 
working with positive definite kernels leads to an equally rich set of nonlinearities 
as working with an empirical kernel map using general non-symmetric kernels. 
If we wanted to carry out the whitening step, we would have to use (K'K)~"/4 (cf. 
footnote 7 concerning potential singularities). 


= Different evaluation sets. Things can be sped up by using expansion sets of the 
form {Z1,...,Z}, mapping into an n-dimensional space, with n < m, as done in 
[100, 228]. In that case, one modifies (2.59) to 


DË : xi Ky 2(K(z1,x),..-,k(Zn,2)), (2.60) 


where (Kn)ij ‘= k(z;,zj). The expansion set can either be a subset of the training 
set, or some other set of points. We will later return to the issue of how to choose 


7. It is understood that if K is singular, we use the pseudo-inverse of K'/? in which case we 
get an even lower dimensional subspace. 

8. In [228] it is recommended that the size n of the expansion set is chosen large enough to 
ensure that the smallest eigenvalue of K, is larger than some predetermined e > 0. Alter- 
natively, one can start off with a larger set, and use kernel PCA to select the most important 
components for the map, see Problem 14.8. In the kernel PCA case, the map (2.60) is com- 
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the best set (see Section 10.2 and Chapter 18). As an aside, note that in the case of 
Kernel PCA (see Section 1.7 and Chapter 14 below), one does not need to worry 
about the whitening step in (2.59) and (2.60): using the canonical dot product in 
IR" (rather than (-,-)) will simply lead to diagonalizing K? instead of K, which 
yields the same eigenvectors with squared eigenvalues. This was pointed out by 
[350, 361]. The study [361] reports experiments where (2.56) was employed to 
speed up Kernel PCA by choosing {z1,. . . , Zn } as a subset of {x1,..., Xm}. 


2.2.7 A Kernel Map Defined from Pairwise Similarities 


In practice, we are given a finite amount of data x1,...,x,,. The following simple 
observation shows that even if we do not want to (or are unable to) analyze a given 
kernel k analytically, we can still compute a map ® such that k corresponds to a 
dot product in the linear span of the ®(x;): 


Proposition 2.16 (Data-Dependent Kernel Map [467]) Suppose the data x1,...,Xm 
and the kernel k are such that the kernel Gram matrix Kij = k(x;, xj) is positive definite. 
Then it is possible to construct a map ® into an m-dimensional feature space H such that 


klai x) = (Cx), (x). (2.61) 


Conversely, given an arbitrary map ® into some feature space K, the matrix Kij = 
(®(x;), B(x ;)) is positive definite. 


Proof First assume that K is positive definite. In this case, it can be diagonalized 
as K = SDS", with an orthogonal matrix S and a diagonal matrix D with nonneg- 
ative entries. Then 


Keni) =(6DS y= (5, DS) = (VDSV D), (2.62) 


where we have defined the S; as the rows of S (note that the columns of S would be 
K’s eigenvectors). Therefore, K is the Gram matrix of the vectors \/Djj- S 19 Hence 
the following map ®, defined on x1, . . ., Xm will satisfy (2.61) 


©: x; V/ Dii- Si. (2.63) 


Thus far, ® is only defined on a set of points, rather than on a vector space. 
Therefore, it makes no sense to ask whether it is linear. We can, however, ask 
whether it can be extended to a linear map, provided the x; are elements of a vector 
space. The answer is that if the x; are linearly dependent (which is often the case), 
then this will not be possible, since a linear map would then typically be over- 


puted as D; iE U (k(Z1,x),.-+,k(Zn,x)), where U,,D,,U,) is the eigenvalue decomposition of 
K,,. Note that the columns of U, are the eigenvectors of K,,. We discard all columns that cor- 
respond to zero eigenvalues, as well as the corresponding dimensions of D,,. To approximate 
the map, we may actually discard all eigenvalues smaller than some e > 0. 

9. In fact, every positive definite matrix is the Gram matrix of some set of vectors [46]. 
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determined by the m conditions (2.63). 
For the converse, assume an arbitrary œ € R”, and compute 
2 


m m m m 
Ș aijKij= ( a;®(x;), X, aoc) = |$ aj(x;)|| > 0. (2.64) 
ij=l i=l j=l i=l 
E 
In particular, this result implies that given data x1,..., Xm, and a kernel k which 


gives rise to a positive definite matrix K, it is always possible to construct a feature 
space H of dimension at most m that we are implicitly working in when using 
kernels (cf. Problem 2.32 and Section 2.2.6). 

If we perform an algorithm which requires k to correspond to a dot product in 
some other space (as for instance the SV algorithms described in this book), it is 
possible that even though k is not positive definite in general, it still gives rise to 
a positive definite Gram matrix K with respect to the training data at hand. In this 
case, Proposition 2.16 tells us that nothing will go wrong during training when we 
work with these data. Moreover, if k leads to a matrix with some small negative 
eigenvalues, we can add a small multiple of some strictly positive definite kernel 
k’ (such as the identity k’(x;, xj) = 6;j) to obtain a positive definite matrix. To see 
this, suppose that Amin < 0 is the minimal eigenvalue of k’s Gram matrix. Note that 
being strictly positive definite, the Gram matrix K’ of k’ satisfies 


ie (a, K'a) > Amin > 0, (2.65) 
al|= 
where A/_;,, denotes its minimal eigenvalue, and the first inequality follows from 


Rayleigh’s principle (B.57). Therefore, provided that Amin + AAhin > 0, we have 


min = 


(a, (K + AK’)ax) = (a, Ka) + (a, K'a) > |jall? (Amin + AXnin) 2 0 (2.66) 


for all œ € R”, rendering (K + AK’) positive definite. 
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Polynomial 


Gaussian 


Sigmoid 


For the following examples, let us assume that X C RY. Besides homogeneous 
polynomial kernels (cf. Proposition 2.1), 


k(x, x!) = (x, x, (2.67) 


Boser, Guyon, and Vapnik [62, 223, 561] suggest the usage of Gaussian radial basis 
function kernels [26, 4], 


k(x, x!) = exp (E) l (2.68) 


2 02 


where o > 0, and sigmoid kernels, 


k(x, x’) = tanh(« (x, x’) + 0), (2.69) 
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where k > 0 and ù < 0. By applying Theorem 13.4 below, one can check that the 
latter kernel is not actually positive definite (see Section 4.6 and [85, 511] and the 
discussion in Example 4.25). Curiously, it has nevertheless successfully been used 
in practice. The reasons for this are discussed in [467]. 

Other useful kernels include the inhomogeneous polynomial, 


k(x, x!) = ((x, x") + o)", (2.70) 


(d € N,c > 0) and the B,,-spline kernel [501, 572] (Ix denoting the indicator (or 
characteristic) function on the set X, and & the convolution operation, (f & g)(x) := 


SIN — x)dx'), 


k(x, x’) = Bop4i(||x — x'||) with B, := & 

i=1 
The kernel computes B-splines of order 2p + 1 (p € N), defined by the (2p + 1)-fold 
convolution of the unit interval [—1/2,1/2]. See Section 4.4.1 for further details 
and a regularization theoretic analysis of this kernel. 

Note that all these kernels have the convenient property of unitary invariance, 
k(x, x’) =k(Ux, Ux’) if UT = U“!, for instance if U is a rotation. If we consider com- 
plex numbers, then we have to use the adjoint U* := T. instead of the transpose. 

Radial basis function (RBF) kernels are kernels that can be written in the form 


k(x, x!) = f (d(x, x’), (2.72) 


i (271) 


id 
2:2 


where d is a metric on X, and f is a function on Rt. Examples thereof are the 
Gaussians and B-splines mentioned above. Usually, the metric arises from the 
dot product; d(x, x’) = ||x — x’|| = /(x — x',x — x’). In this case, RBF kernels are 
unitary invariant, too. In addition, they are translation invariant; in other words, 
k(x, x’) = k(x + xo, x’ + xo) for all xp € X. 

In some cases, invariance properties alone can distinguish particular kernels: in 
Section 2.1, we explained how using polynomial kernels (x, x’)? corresponds to 
mapping into a feature space whose dimensions are spanned by all possible dth 
order monomials in input coordinates. The different dimensions are scaled with 
the square root of the number of ordered products of the respective d entries (e.g., 
v2 in (2.13)). These scaling factors precisely ensure invariance under the group 
of all orthogonal transformations (rotations and mirroring operations). In many 
cases, this is a desirable property: it ensures that the results of a learning procedure 
do not depend on which orthonormal coordinate system (with fixed origin) we use 
for representing our input data. 


Proposition 2.17 (Invariance of Polynomial Kernels [480]) Up to a scaling factor, 
the kernel k(x, x’) = (x, x! is the only kernel inducing a map into a space of all monomi- 
als of degree d which is invariant under orthogonal transformations of RN. 


Some interesting additional structure exists in the case of a Gaussian RBF kernel k 
(2.68). As k(x, x) = 1 for all x € X, each mapped example has unit length, ||®(x)|| = 
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1 (Problem 2.18 shows how to achieve this for general kernels). Moreover, as 
k(x, x’) > 0 for all x, x’ € X, all points lie inside the same orthant in feature space. To 
see this, recall that for unit length vectors, the dot product (1.3) equals the cosine 
of the enclosed angle. We obtain 


cos(Z(®(x), B(x’))) = (P(x), B(x’)) = k(x, x’) > 0, (2.73) 


which amounts to saying that the enclosed angle between any two mapped exam- 
ples is smaller than z /2. 

The above seems to indicate that in the Gaussian case, the mapped data lie in 
a fairly restricted area of feature space. However, in another sense, they occupy a 
space which is as large as possible: 


Theorem 2.18 (Full Rank of Gaussian RBF Gram Matrices [360]) Suppose that 
X1,---;Xm C Xare distinct points, and o # 0. The matrix K given by 


k oo [izak ea 
ij := exp -y ( a ) 


has full rank. 


In other words, the points ®(x;),...,®(x;,) are linearly independent (provided 
no two x; are the same). They span an m-dimensional subspace of H. Therefore 
a Gaussian kernel defined on a domain of infinite cardinality, with no a priori 
restriction on the number of training examples, produces a feature space of infinite 
dimension. Nevertheless, an analysis of the shape of the mapped data in feature 
space shows that capacity is distributed in a way that ensures smooth and simple 
estimates whenever possible (see Section 12.4). 

The examples given above all apply to the case of vectorial data. Let us next give 
an example where X is not a vector space [42]. 


Proposition 2.19 (Similarity of Probabilistic Events) If (X,C,P) is a probability 
space with o-algebra € and probability measure P, then 

k(A, B) = P(A N B) — P(A)P(B) (2.75) 
is a positive definite kernel on € x €. 

Proof ‘To see this, we define a feature map 

®: Ars (I4 —P(A)), (2.76) 


where I, is the characteristic function on A. On the feature space, which consists 
of functions on X taking values in [—1, 1], we use the dot product 


f58) = [fs dP. (2.77) 
The result follows by noticing (14, Ip) = P(A N B) and (I4, P(B)) = P(A)P(B). 
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Further examples include kernels for string matching, as proposed by [585, 234, 
23]. We shall describe these, and address the general problem of designing kernel 
functions, in Chapter 13. 

The next section will return to the connection between kernels and feature 
spaces. Readers who are eager to move on to SV algorithms may want to skip 
this section, which is somewhat more technical. 


2.4 The Representation of Dissimilarities in Linear Spaces 


2.4.1 Conditionally Positive Definite Kernels 


We now proceed to a larger class of kernels than that of the positive definite ones. 
This larger class is interesting in several regards. First, it will turn out that some 
kernel algorithms work with this class, rather than only with positive definite 
kernels. Second, its relationship to positive definite kernels is a rather interesting 
one, and a number of connections between the two classes provide understanding 
of kernels in general. Third, they are intimately related to a question which is 
a variation on the central aspect of positive definite kernels: the latter can be 
thought of as dot products in feature spaces; the former, on the other hand, can 
be embedded as distance measures arising from norms in feature spaces. 

The present section thus attempts to extend the utility of the kernel trick by 
looking at the problem of which kernels can be used to compute distances in 
feature spaces. The underlying mathematical results have been known for quite 
a while [465]; some of them have already attracted interest in the kernel methods 
community in various contexts [515, 234]. 

Clearly, the squared distance ||®(x) — ®(x’)||? in the feature space associated with 
a pd kernel k can be computed, using k(x, x’) = (P(x), B(x’)), as 


ID) — DIP = k(x, x) + K(x’, x") — 2k(x, x"). (2.78) 


Positive definite kernels are, however, not the full story: there exists a larger class 
of kernels that can be used as generalized distances, and the present section will 
describe why and how [468]. 

Let us start by considering how a dot product and the corresponding distance 
measure are affected by a translation of the data, x + x — xo. Clearly, ||x — x’||? is 
translation invariant while (x, x’) is not. A short calculation shows that the effect 


of the translation can be expressed in terms of ||. — .||? as 
1 
(x —x0)5 (0! —x0)) = E (=le — xI + Ie- xol + lo — IP). 2.79) 


Note that this, just like (x, x’), is still a pd kernel: X; ;cicj (xi — xo), (xj — Xo)) = 
|| Si ci(x; — xX0)||* > 0 holds true for any c;. For any choice of xo € X, we thus get a 
similarity measure (2.79) associated with the dissimilarity measure ||x — x’||. 

This naturally leads to the question of whether (2.79) might suggest a connection 
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that also holds true in more general cases: what kind of nonlinear dissimilarity 
measure do we have to substitute for ||. — .||? on the right hand side of (2.79), to 
ensure that the left hand side becomes positive definite? To state the answer, we 
first need to define the appropriate class of kernels. 

The following definition differs from Definition 2.4 only in the additional con- 
straint on the sum of the c;. Below, K is a shorthand for C or R; the definitions are 
the same in both cases. 


Definition 2.20 (Conditionally Positive Definite Matrix) A symmetric m x m ma- 
trix K (m > 2) taking values in K and satisfying 


pS cic; Kj; > 0 for all c; € K, with 5 ĝ=0, (2.80) 
i, j=1 i=1 


is called conditionally positive definite (cpd). 


Definition 2.21 (Conditionally Positive Definite Kernel) Let X be a nonempty set. 
A function k: X x X — K which for all m > 2,x1,...,Xm € X gives rise to a conditionally 
positive definite Gram matrix is called a conditionally positive definite (cpd) kernel. 


Note that symmetry is also required in the complex case. Due to the additional 
constraint on the coefficients c;, it does not follow automatically anymore, as it 
did in the case of complex positive definite matrices and kernels. In Chapter 4, we 
will revisit cpd kernels. There, we will actually introduce cpd kernels of different 
orders. The definition given in the current chapter covers the case of kernels which 
are cpd of order 1. 


Proposition 2.22 (Constructing PD Kernels from CPD Kernels [42]) Let xo € X, 
and let k be a symmetric kernel on X x X. Then 


k(x, x’) = Tko, x!) — k(x, xo) — k(xo, x’) + k(x0, x0)) 
is positive definite if and only if k is conditionally positive definite. 


The proof follows directly from the definitions and can be found in [42]. This 
result does generalize (2.79): the negative squared distance kernel is indeed cpd, 
since ¥;c; = 0 implies — X; ;¢;c,||xi — xjl? = -Eci £ cll? — Ej c E cilli? + 
25; jcicj (xi xj) = 25; tie; (xi xj) = 2||Xicixi||? > 0. In fact, this implies that all 
kernels of the form 


k(x,x') = —||x—x'||P,0<p <2 (2.81) 


are cpd (they are not pd),!° by application of the following result (note that the 
case ĝ = 0 is trivial): 


10. Moreover, they are not cpd if 6 > 2 [42]. 
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Proposition 2.23 (Fractional Powers and Logs of CPD Kernels [42]) [fk : X x X > 
(—oo, 0] is cpd, then so are —(—k)* (0 < a < 1) and —ln(1 — k). 


To state another class of cpd kernels that are not pd, note first that as a trivial 
consequence of Definition 2.20, we know that (i) sums of cpd kernels are cpd, and 
(ii) any constant b € R is a cpd kernel. Therefore, any kernel of the form k + b, 
where k is cpd and b € R, is also cpd. In particular, since pd kernels are cpd, we 
can take any pd kernel and offset it by b, and it will still be at least cpd. For further 
examples of cpd kernels, cf. [42, 578, 205, 515]. 


2.4.2 Hilbert Space Representation of CPD Kernels 


We now return to the main flow of the argument. Proposition 2.22 allows us to 
construct the feature map for k from that of the pd kernel k. To this end, fix xo € X 
and define k according to Proposition 2.22. Due to Proposition 2.22, k is positive 
definite. Therefore, we may employ the Hilbert space representation ® : X —> H of 
k (cf. (2.32)), satisfying (®(x), B(x’)) = k(x, x’); hence, 


[| ®(x) — Px’)? = (x, x) + klx’, x’) — 2k(x, x’). (2.82) 
Substituting Proposition 2.22 yields 

1 
||®(x) — B(x’) ||? = —k(x, x’) + 5 (k(x, x) +(x’, x’) . (2.83) 


This implies the following result [465, 42]. 


Proposition 2.24 (Hilbert Space Representation of CPD Kernels) Let k be a real- 
valued CPD kernel on X, satisfying k(x, x) = 0 for all x € X. Then there exists a Hilbert 
space H of real-valued functions on X, and a mapping ® : X — K, such that 


|| (x) — B(x’) ||? = —k(x, x’). (2.84) 
If we drop the assumption k(x, x) = 0, the Hilbert space representation reads 

1 
PE) — DNI? = k(x, x’) + 5 (klx, x) +k’, x’). (2.85) 


It can be shown that if k(x, x) = 0 for all x € X, then 
d(x, x’) = KR, x) = |E) — (’)|| (2.86) 


is a semi-metric: clearly, it is nonnegative and symmetric; additionally, it satisfies 
the triangle inequality, as can be seen by computing d(x, x’) + d(x’, x”) = ||®(x) — 
DIE PE) — DI > BE) — D = d(x, x") [42]. 

It is a metric if k(x, x’) £0 for x A x’. We thus see that we can rightly think of k 
as the negative of a distance measure. 

We next show how to represent general symmetric kernels (thus in particular 
cpd kernels) as symmetric bilinear forms Q in feature spaces. This generalization 
of the previously known feature space representation for pd kernels comes at a 
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cost: Q will no longer be a dot product. For our purposes, we can get away with 
this. The result will give us an intuitive understanding of Proposition 2.22: we 
can then write k as k(x, x’) := Q(®(x) — ®(xo), P(x’) — P(xo)). Proposition 2.22 thus 
essentially adds an origin in feature space which corresponds to the image (x0) 
of one point xo under the feature map. 


Proposition 2.25 (Vector Space Representation of Symmetric Kernels) Let k be a 
real-valued symmetric kernel on X. Then there exists a linear space K of real-valued 
functions on X, endowed with a symmetric bilinear form Q(., .), and a mapping ® : X —> 
K, such that k(x, x’) = Q(®(x), B(x’). 


Proof The proof is a direct modification of the pd case. We use the map (2.21) and 
linearly complete the image as in (2.22). Define Q(f, g) := 21 An aib jk(xi, xi). To 
see that it is well-defined, although it explicitly contains the expansion coefficients 
(which need not be unique), note that Q(f,g) = "a Bif (xj), independent of the 
ai. Similarly, for g, note that Q(f, g) = X; aig(x;), hence it is independent of 3}. The 


last two equations also show that Q is bilinear; clearly, it is symmetric. a 


Note, moreover, that by definition of Q, k is a reproducing kernel for the fea- 
ture space (which is not a Hilbert space): for all functions f (2.22), we have 
Q(k(., x), f) = f(x); in particular, Q(k(., x), k(.,x’)) = k(x, x’). 

Rewriting k as k(x, x’) := Q(®(x) — (x0), P(x’) — &(x0)) suggests an immediate 
generalization of Proposition 2.22: in practice, we might want to choose other 
points as origins in feature space — points that do not have a pre-image xo in 
the input domain, such as the mean of a set of points (cf. [543]). This will be useful 
when considering kernel PCA. It is only crucial that the behavior of our reference 
point under translation is identical to that of individual points. This is taken care 
of by the constraint on the sum of the c; in the following proposition. 


Proposition 2.26 (Exercise 2.23 in [42]) Let K be a symmetric matrix, e € R” be the 
vector of all ones, 1 the m x m identity matrix, and let c € C” satisfy e*c = 1. Then 


K := (1 —ec*)K(1 — ce*) (2.87) 
is positive definite if and only if K is conditionally positive definite.“ 

Proof “=>”: suppose K is positive definite. Thus for any a € C” which satisfies 
a*e =e*a = 0, we have 0 < a*Ka=a*Ka-+a‘*ec* Kce*a — ař Kce*a —a*ec* Ka =a‘ Ka. 
This means that 0 < a*Ka, proving that K is conditionally positive definite. 


“<="" suppose K is conditionally positive definite. This means that we have to 
show that a*Ka > 0 for all a € C”. We have 


a*Ka = a*(1— ec*)K(1 — ce*)a = s*Ks for s = (1 — ce*)a. (2.88) 


11. c* is the vector obtained by transposing and taking the complex conjugate of c. 
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All we need to show is e*s = 0, since then we can use the fact that K is cpd to 
obtain s* Ks > 0. This can be seen as follows e*s = e*(1 — ce*)a = (e* — (e*c)e*)a = 
(e* —e*)a=0. a 


This result directly implies a corresponding generalization of Proposition 2.22: 


Proposition 2.27 (Adding a General Origin) Let k be asymmetric kernel, x1,...,Xm € 
X, and let c; € C satisfy Yi, ci = 1. Then 
m m 


Ro, x) = E(k) — È cikle, x) — S, cikla x) + X, cicli x) 
i=1 i=1 


i,j=l 


is positive definite if and only if k is conditionally positive definite. 


Proof Consider a set of m’ € N points x/,...,x/,, € X, and let K be the (m + m’) x 
(m + m’) Gram matrix based on %1,...,%m,X4,+++5X},-- Apply Proposition 2.26 
using Cm41 = «++ = Cm4m' = 0. E 


The above results show that conditionally positive definite kernels are a natural 
choice whenever we are dealing with a translation invariant problem, such as the 
SVM: maximization of the margin of separation between two classes of data is 
independent of the position of the origin. Seen in this light, it is not surprising that 
the structure of the dual optimization problem (cf. [561]) allows cpd kernels: as 
noted in [515, 507], the constraint Y2, a;y; = 0 projects out the same subspace as 
(2.80) in the definition of conditionally positive definite matrices. 

Another example of a kernel algorithm that works with conditionally positive 
definite kernels is Kernel PCA (Chapter 14), where the data are centered, thus 
removing the dependence on the origin in feature space. Formally, this follows 
from Proposition 2.26 for c; = 1/m. 

Let us consider another example. One of the simplest distance-based classifica- 
tion algorithms proceeds as follows. Given m, points labelled with +1, m_ points 
labelled with —1, and a mapped test point ®(x), we compute the mean squared 
distances between the latter and the two classes, and assign it to the one for which 
this mean is smaller; 


v= (= È EE- ae) I- = È lew os). (2.89) 


T yal + y=1 
We use the distance kernel trick (Proposition 2.24) to express the decision function 
as a kernel expansion in the input domain: a short calculation shows that 


= k = 
y = sgn (= > (x, x) = = 
with the constant offset 


1 
= dm = k(x;, xi) ee 5 k(xj, x i) (2.91) 


y=-1 M+ y= 


È k(x, x; +t), (2.90) 


y= 
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Note that for some cpd kernels, such as (2.81), k(x;, x;) is always 0, and thus b = 0. 
For others, such as the commonly used Gaussian kernel, k(x;,x;) is a nonzero con- 
stant, in which case b vanishes provided that m, = m_. For normalized Gaussians, 
the resulting decision boundary can be interpreted as the Bayes decision based on 
two Parzen window density estimates of the classes; for general cpd kernels, the 
analogy is merely a formal one; that is, the decision functions take the same form. 

Many properties of positive definite kernels carry over to the more general case 
of conditionally positive definite kernels, such as Proposition 13.1. 

Using Proposition 2.22, one can prove an interesting connection between the 
two classes of kernels: 


Proposition 2.28 (Connection PD — CPD [465]) A kernel k is conditionally positive 
definite if and only if exp(tk) is positive definite for all t > 0. 


Positive definite kernels of the form exp(tk) (t > 0) have the interesting property 
that their nth root (n € N) is again a positive definite kernel. Such kernels are 
called infinitely divisible. One can show that, disregarding some technicalities, the 
logarithm of an infinitely divisible positive definite kernel mapping into Rf is a 
conditionally positive definite kernel. 


2.4.3 Higher Order CPD Kernels 


For the sake of completeness, we now present some material which is of interest to 
one section later in the book (Section 4.8), but not central for the present chapter. 
We follow [341, 204]. 


Definition 2.29 (Conditionally Positive Definite Functions of Order q) A contin- 
uous function h, defined on [0, oo), is called conditionally positive definite (cpd) of order q 


on RN if for any distinct points x1,...,Xm € IRN, the quadratic form, 
m 
Y aajh(\|x; — x;|)), (2.92) 
i,j=l 
is nonnegative, provided that the scalars a1,...,Qm satisfy >", aip(xi) = 0, for all 


polynomials p(-) on IRN of degree lower than q. 


Let m7’ denote the space of polynomials of degree lower than q on RY. By 
definition, every cpd function h of order q generates a positive definite kernel for 
SV expansions in the space of functions orthogonal to Ij’, by setting k(x, x’) := 
hi(||x = IP). 

There exists also an analogue to the positive definiteness of the integral operator 
in the conditions of Mercer’s theorem. In [157, 341] it is shown that for cpd 


functions h of order q, we have 
fide- IDEAS dxdx > 0, (2.93) 


provided that the projection of f onto IY is zero. 
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Figure 2.4 Conditionally positive definite functions, as described in Table 2.1. Where 
applicable, we set the free parameter c to 1; 8 is set to 2. Note that cpd kernels need not 
be positive anywhere (e.g., the Multiquadric kernel). 


Table 2.1 Examples of Conditionally Positive Definite Kernels. The fact that the exponen- 
tial kernel is pd (i.e., cpd of order 0) follows from (2.81) and Proposition 2.28. 
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Definition 2.30 (Completely Monotonic Functions) A function h(x) is called com- 
pletely monotonic of order q if 


n 


Dno) > 0 for all x € [0, 00) and n > q. (2.94) 


It can be shown [464, 465, 360] that a function h(x?) is conditionally positive 
definite if and only if A(x) is completely monotonic of the same order. This gives a 
(sometimes simpler) criterion for checking whether a function is cpd or not. 

If we use cpd kernels in learning algorithms, we must ensure orthogonality of 
the estimate with respect to II. This is usually done via constraints $4; a;p(x;) = 


0 for all polynomials p(-) on RN of degree lower than q (see Section 4.8). 
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2.6 Problems 


The crucial ingredient of SVMs and other kernel methods is the so-called kernel 
trick (see (2.7) and Remark 2.8), which permits the computation of dot products in 
high-dimensional feature spaces, using simple functions defined on pairs of input 
patterns. This trick allows the formulation of nonlinear variants of any algorithm 
that can be cast in terms of dot products, SVMs being but the most prominent ex- 
ample. The mathematical result underlying the kernel trick is almost a century old 
[359]. Nevertheless, it was only much later that it was exploited by the machine 
learning community for the analysis [4] and construction of algorithms [62], and 
that it was described as a general method for constructing nonlinear generaliza- 
tions of dot product algorithms [480]. 

The present chapter has reviewed the mathematical theory of kernels. We 
started with the class of polynomial kernels, which can be motivated as com- 
puting a combinatorially large number of monomial features rather efficiently. 
This led to the general question of which kernel can be used, or: which kernel 
can be represented as a dot product in a linear feature space. We defined this class 
and discussed some of its properties. We described several ways how, given sucha 
kernel, one can construct a representation in a feature space. The most well-known 
representation employs Mercer’s theorem, and represents the feature space as an 
l space defined in terms of the eigenfunctions of an integral operator associated 
with the kernel. An alternative representation uses elements of the theory of re- 
producing kernel Hilbert spaces, and yields additional insights, representing the 
linear space as a space of functions written as kernel expansions. We gave an in- 
depth discussion of the kernel trick in its general form, including the case where 
we are interested in dissimilarities rather than similarities; that is, when we want 
to come up with nonlinear generalizations of distance-based algorithms rather 
than dot-product-based algorithms. 

In both cases, the underlying philosophy is the same: we are trying to express a 
complex nonlinear algorithm in terms of simple geometrical concepts, and we are 
then dealing with it in a linear space. This linear space may not always be readily 
available; in some cases, it may even be hard to construct explicitly. Nevertheless, 
for the sake of design and analysis of the algorithms, it is sufficient to know that 
the linear space exists, empowering us to use the full potential of geometry, linear 
algebra and functional analysis. 


2.1 (Monomial Features in R? e) Verify the second equality in (2.9). 


2.2 (Multiplicity of Monomial Features in R [515] ee) Consider the monomial ker- 
nel k(x, = (x, x’ y (where x, x’ € RN), generating monomial features of order d. Prove 
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that a valid feature map for this kernel can be defined coordinate-wise as 


d! n 
q(x) = Foe Le (2.95) 


i=1 T j= 


for every m € N' , X; [m]; = d (i.e., every such m corresponds to one dimension of H). 


2.3 (Inhomogeneous Polynomial Kernel ee) Prove that the kernel (2.70) induces a 
feature map into the space of all monomials up to degree d. Discuss the role of c. 


2.4 (Eigenvalue Criterion of Positive Definiteness e) Prove that a symmetric matrix 
is positive definite if and only if all its eigenvalues are nonnegative (see Appendix B). 


2.5 (Dot Products are Kernels e) Prove that dot products (Definition B.7) are positive 
definite kernels. 


2.6 (Kernels on Finite Domains ee) Prove that for finite X, say X = {x1,...,Xm}, k is 
a kernel if and only if the m x m matrix (k(x;,x;))ij is positive definite. 


2.7 (Positivity on the Diagonal e) From Definition 2.5, prove that a kernel satisfies 
k(x, x) > 0 for all x € X. 


2.8 (Cauchy-Schwarz for Kernels ee) Give an elementary proof of Proposition 2.7. 
Hint: start with the general form of a symmetric 2 x 2 matrix, and derive conditions for 
its coefficients that ensure that it is positive definite. 


2.9 (PD Kernels Vanishing on the Diagonal e) Use Proposition 2.7 to prove that a 
kernel satisfying k(x, x) = for all x € X is identically zero. 
How does the RKHS look in this case? Hint: use (2.31). 


2.10 (Two Kinds of Positivity e) Give an example of a kernel which is positive definite 
according to Definition 2.5, but not positive in the sense that k(x, x’) > 0 for all x, x’. 
Give an example of a kernel where the contrary is the case. 


2.11 (General Coordinate Transformations e) Prove that if o : X — X is a bijection, 
and k(x, x’) is a kernel, then k(a(x), a(x") is a kernel, too. 


2.12 (Positivity on the Diagonal e) Prove that positive definite kernels are positive on 
the diagonal, k(x, x) > 0 for all x € X. Hint: use m = 1 in (2.15). 


2.13 (Symmetry of Complex Kernels ee) Prove that complex-valued positive definite 
kernels are symmetric (2.18). 


2.14 (Real Kernels vs. Complex Kernels e) Prove that a real matrix satisfies (2.15) for 
all c; € C if and only if it is symmetric and it satisfies (2.15) for real coefficients cj. 
Hint: decompose each c; in (2.15) into real and imaginary parts. 


2.6 Problems 
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2.15 (Rank-One Kernels e) Prove that if f is a real-valued function on X, then k(x, x’) := 
f(x) f(x’) is a positive definite kernel. 


2.16 (Bayes Kernel ee) Consider a binary pattern recognition problem. Specialize the 
last problem to the case where f : X — {+1} equals the Bayes decision function y(x), 
i.e., the classification with minimal risk subject to an underlying distribution P(x,y) 
generating the data. 

Argue that this kernel is particularly suitable since it renders the problem linearly 
separable in a 1D feature space: State a decision function (cf. (1.35)) that solves the problem 
(hint: you just need one parameter œa, and you may set it to 1; moreover, use b = 0) [124]. 

The final part of the problem requires knowledge of Chapter 16: Consider now the 
situation where some prior P(f) over the target function class is given. What would the 
optimal kernel be in this case? Discuss the connection to Gaussian processes. 


2.17 (Inhomogeneous Polynomials e) Prove that the inhomogeneous polynomial (2.70) 
is a positive definite kernel, e.g., by showing that it is a linear combination of homogeneous 
polynomial kernels with positive coefficients. What kind of features does this kernel com- 
pute [561]? 


2.18 (Normalization in Feature Space e) Given a kernel k, construct a corresponding 
normalized kernel k by normalizing the feature map ® such that for all x € X, ||®(x)|| =1 
(cf. also Definition 12.35). Discuss the relationship between normalization in input space 
and normalization in feature space for Gaussian kernels and homogeneous polynomial 
kernels. 


2.19 (Cosine Kernel e) Suppose X is a dot product space, and x,x’ € X. Prove that 
k(x,x’) = cos(Z(x,x)) is a positive definite kernel. Hint: use Problem 2.18. 


2.20 (Alignment Kernel e) Let (K, K’), := Yi; KijKj, be the Frobenius dot product 
of two matrices. Prove that the empirical alignment of two Gram matrices [124], 
A(K, K’) := (K, RK’), /\/(K, K)p (K’, K")p, is a positive definite kernel. 


Note that the alignment can be used for model selection, putting Kij = yiyj (cf. 
Problem 2.16) and K;ij := sgn (k(x;, x;)) or Kij := sgn (k(x;, x;)) — b (cf. [124]). 


2.21 (Equivalence Relations as Kernels eee) Consider a similarity measure k : X > 
{0,1} with 

k(x, x) = 1 for all x € X. (2.96) 
Prove that k is a positive definite kernel if and only if, for all x, x', x” € X, 

k(x, x^ =1 4> k(x’, x) = 1 and (2.97) 
ee = ee) = ae) SH 1, (2.98) 


Equations (2.96) to (2.98) amount to saying that k = Ir, where T C X x X is an equiva- 
lence relation. 
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As a simple example, consider an undirected graph, and let (x, x’) € T whenever x and x' 
are in the same connected component of the graph. Show that T is an equivalence relation. 

Find examples of equivalence relations that lend themselves to an interpretation as 
similarity measures. Discuss whether there are other relations that one might want to use 
as similarity measures. 


2.22 (Different Feature Spaces for the Same Kernel e) Give an example of a kernel 
with two valid feature maps ®1, Pz, mapping into spaces Hı, Hz of different dimensions. 


2.23 (Converse of Mercer’s Theorem e) Prove that if an integral operator kernel k 
admits a uniformly convergent dot product representation on some compact set X x X, 


k(x, x) = F pO’, (2.99) 
i=1 


then it is positive definite. Hint: show that 


oo oe) 2 
i. x (5 piai ) fx) f(x) dx dx = 2 ( f Wilx) f(x) ax) 2 0. 


i=1 
Argue that in particular, polynomial kernels (2.67) satisfy Mercer's conditions. 


2.24 (oo-Norm of Mercer Eigenfunctions ee) Prove that under the conditions of The- 
orem 2.10, we have, up to sets of measure zero, 


sup |, far| < 1/Ilklloo < 00. (2.100) 
j [0.0] 

Hint: note that ||k||o. > k(x, x) up to sets of measures zero, and use the series expansion 

given in Theorem 2.10. Show, moreover, that it is not generally the case that 


sup ||1)jlloo < 00. (2.101) 
J 


Hint: consider the case where X = N, p({n}) := 27", and k(i, j) := 6;;. Show that 
1. T((a;)) = (aj27/) for (aj) € Lo(X, p), 
2. Tp satisfies ((a;),T(aj)) = Xj(aj2~/° > 0 and is thus positive definite, 


3. Aj =27 and p; = 2i/ *e; form an orthonormal eigenvector decomposition of Ty (here, 
ej is the jth canonical unit vector in £2), and 


4, ||Willo = 2/7? = 7". 


Argue that the last statement shows that (2.101) is wrong and (2.100) is tight. 1 


2.25 (Generalized Feature Maps eee) Via (2.38), Mercer kernels induce compact (in- 
tegral) operators. Can you generalize the idea of defining a feature map associated with an 


12. Thanks to S. Smale and I. Steinwart for this exercise. 
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operator to more general bounded positive definite operators T? Hint: use the multiplica- 
tion operator representation of T [467]. 


2.26 (Nyström Approximation (cf. [603]) e) Consider the integral operator obtained 
by substituting the distribution P underlying the data into (2.38), i.e., 


TAE) = i k(x, x’) f(x) dP(x). (2.102) 


If the conditions of Mercer’s theorem are satisfied, then k can be diagonalized as 


k(x, x’) = E yue, (2.103) 
j= 

where A; and j satisfy the eigenvalue equation 

[Fox vie dP(x) = Ajpj(x’) (2.104) 

and the orthonormality conditions 

[ARP = y. (2.105) 

Show that by replacing the integral by a summation over an iid sample X = {x1,...,Xm} 


from P(x), one can recover the kernel PCA eigenvalue problem (Section 1.7). Hint: Start 

by evaluating (2.104) for x' € X, to obtain m equations. Next, approximate the integral by 

a sum over the points in X, replacing fy k(x, x')yj(x) dP(x) by 1 En K(Xn, X')hj(Xn). 
Derive the orthogonality condition for the eigenvectors (W j(Xn))n=1,...4m from (2.105). 


2.27 (Lorentzian Feature Spaces ee) If a finite number of eigenvalues is negative, the 
expansion in Theorem 2.10 is still valid. Show that in this case, k corresponds to a 
Lorentzian symmetric bilinear form in a space with indefinite signature [467]. 

Discuss whether this causes problems for learning algorithms utilizing these kernels. In 
particular, consider the cases of SV machines (Chapter 7) and Kernel PCA (Chapter 14). 


2.28 (Symmetry of Reproducing Kernels e) Show that reproducing kernels (Defini- 
tion 2.9) are symmetric. Hint: use (2.35) and exploit the symmetry of the dot product. 


2.29 (Coordinate Representation in the RKHS ee) Write (-,-) as a dot product of 


coordinate vectors by expressing the functions of the RKHS in the basis (/Antn)n=1,....No¢7 
which is orthonormal with respect to {-,-), i.e., 


Ny 
f= Yan Anthalx)- (2.106) 
n=1 


Obtain an expression for the coordinates ay, using (2.47) and ay = ( fiv Ann) . Show 
that H has the structure of a RKHS in the sense that for f and g given by (2.106), and 


Ny 
g(x) = 2 By, (2.107) 
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we have (a, B) = (fg). Show, moreover, that f(x) = (a, P(x)) in H. In other words, 
(x) is the coordinate representation of the kernel as a function of one argument. 


2.30 (Equivalence of Regularization Terms e) Using (2.36) and (2.41), prove that 
||w||?, where w = >", a;®(x;), is the same no matter whether ® denotes the RKHS fea- 
ture map (2.21) or the Mercer feature map (2.40). 


2.31 (Approximate Inversion of Gram Matrices ee) Use the kernel PCA map (2.59) 
to derive a method for approximately inverting a large Gram matrix. 


2.32 (Effective Dimension of Feature Space e) Building on Section 2.2.7, argue that 
for a finite data set, we are always effectively working in a finite-dimensional feature space. 


2.33 (Translation of a Dot Product e) Prove (2.79). 


2.34 (Example of a CPD Kernel ee) Argue that the hyperbolic tangent kernel (2.69) is 
effectively conditionally positive definite, if the input values are suitably restricted, since 
it can be approximated by k + b, where k is a polynomial kernel (2.67) and b € R. Discuss 
how this explains that hyperbolic tangent kernels can be used for SVMs although, as 
pointed out in number of works (e.g., [86], cf. the remark following (2.69)), they are not 
positive definite. 


2.35 (Polarization Identity ee) Prove the polarization identity, stating that for any 
symmetric bilinear form (-, +) : X x X > R, we have, for all x,x' € X, 


(x)= 5 (te ete a 2). (2.108) 


Now consider the special case where {-,-) is a Euclidean dot product and (x — x',x — x') 
is the squared Euclidean distance between x and x’. Discuss why the polarization identity 
does not imply that the value of the dot product can be recovered from the distances alone. 
What else does one need? 


2.36 (Vector Space Representation of CPD Kernels eee) Specialize the vector space 
representation of symmetric kernels (Proposition 2.25) to the case of cpd kernels. Can you 
identify a subspace on which a cpd kernel is actually pd? 


2.37 (Parzen Windows Classifiers in Feature Space ee) Assume that k is a positive 
definite kernel. Compare the algorithm described in Section 1.2 with the one of (2.89). Con- 
struct situations where the two algorithms give different results. Hint: consider datasets 
where the class means coincide. 


2.38 (Canonical Distortion Kernel 000) Can you define a kernel based on Baxter’s 
canonical distortion metric [28]? 


Overview 


Risk and Loss Functions 


One of the most immediate requirements in any learning problem is to specify 
what exactly we would like to achieve, minimize, bound, or approximate. In other 
words, we need to determine a criterion according to which we will assess the 
quality of an estimate f : X — Y obtained from data. 

This question is far from trivial. Even in binary classification there exist ample 
choices. The selection criterion may be the fraction of patterns classified correctly, 
it could involve the confidence with which the classification is carried out, or it 
might take into account the fact that losses are not symmetric for the two classes, 
such as in health diagnosis problems. Furthermore, the loss for an error may be 
input-dependent (for instance, meteorological predictions may require a higher ac- 
curacy in urban regions), and finally, we might want to obtain probabilities rather 
than a binary prediction of the class labels —1 and 1. Multi class discrimination and 
regression add even further levels of complexity to the problem. Thus we need a 
means of encoding these criteria. 

The chapter is structured as follows: in Section 3.1, we begin with a brief 
overview of common loss functions used in classification and regression algo- 
rithms. This is done without much mathematical rigor or statistical justification, 
in order to provide basic working knowledge for readers who want to get a quick 
idea of the default design choices in the area of kernel machines. Following this, 
Section 3.2 formalizes the idea of risk. The risk approach is the predominant tech- 
nique used in this book, and most of the algorithms presented subsequently mini- 
mize some form of a risk functional. Section 3.3 treats the concept of loss functions 
from a statistical perspective, points out the connection to the estimation of den- 
sities and introduces the notion of efficiency. Readers interested in more detail 
should also consider Chapter 16, which discusses the problem of estimation from 
a Bayesian perspective. The later parts of this section are intended for readers in- 
terested in the more theoretical details of estimation. The concept of robustness is 
introduced in Section 3.4. Several commonly used loss functions, such as Huber’s 
loss and the ¢-insensitive loss, enjoy robustness properties with respect to rather 
general classes of distributions. Beyond the basic relations, will show how to ad- 
just the e-insensitive loss in such a way as to accommodate different amounts of 
variance automatically. This will later lead to the construction of so-called v Sup- 
port Vector Algorithms (see Chapters 7, 8, and 9). 

While technical details and proofs can be omitted for most of the present chap- 
ter, we encourage the reader to review the practical implications of this section. 
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3.1.1 Classification 


3.1.2 Regression 


3.2 Empirical 3.3.1 Maximum Likelihood | 
Risk Functional 


ea 3.3.2 Efficiency | 


| | 3.4 Robustness 
3.4.3, 3.4.4 Adaptive 3.4.2 insensitive loss 


Loss functions & V 
As usual, exercises for all sections can be found at the end. The chapter requires 
knowledge of probability theory, as introduced in Section B.1. 


3.1 Loss Functions 


Minimized Loss 
# Incurred Loss 


Misclassification 
Error 


Let us begin with a formal definition of what we mean by the loss incurred by a 
function f at location x, given an observation y. 


Definition 3.1 (Loss Function) Denote by (x,y, f(x)) E€ X x Y x Y the triplet consist- 
ing of a pattern x, an observation y and a prediction f(x). Then the map c : X x Y x Y —> 
[0, o0) with the property c(x, y, y) = 0 forall x € Xand y € Y will be called a loss function. 


Note that we require c to be a nonnegative function. This means that we will never 
get a payoff from an extra good prediction. If the latter was the case, we could 
always recover non-negativity (provided the loss is bounded from below), by 
using a simple shift operation (possibly depending on x). Likewise we can always 
satisfy the condition that exact predictions (f(x) = y) never cause any loss. The 
advantage of these extra conditions on c is that we know that the minimum of the 
loss is 0 and that it is obtainable, at least for a given x, y. 

Next we will formalize different kinds of loss, as described informally in the 
introduction of the chapter. Note that the incurred loss is not always the quantity 
that we will attempt to minimize. For instance, for algorithmic reasons, some loss 
functions will prove to be infeasible (the binary loss, for instance, can lead to NP- 
hard optimization problems [367]). Furthermore, statistical considerations such as 
the desire to obtain confidence levels on the prediction (Section 3.3.1) will also 
influence our choice. 


3.1.1 Binary Classification 


The simplest case to consider involves counting the misclassification error if pat- 
tern x is classified wrongly we incur loss 1, otherwise there is no penalty.: 


0 ify=f(x) 


: (3.1) 
1 otherwise 


c(x, y, f(x) = 


3.1 Loss Functions 


Asymmetric and 
Input-Dependent 
Loss 


Confidence Level 


Soft Margin Loss 


Logistic Loss 
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This definition of c does not distinguish between different classes and types of 
errors (false positive or negative).! 

A slight extension takes the latter into account. For the sake of simplicity let us 
assume, as in (3.1), that we have a binary classification problem. This time, how- 
ever, the loss may depend on a function ¢(x) which accounts for input-dependence, 
i.e. 


0 ify=f(x) 


(3.2) 
č(x) otherwise 


c(x, y, f(x)) = | 
A simple (albeit slightly contrived) example is the classification of objects into 
rocks and diamonds. Clearly, the incurred loss will depend largely on the weight 
of the object under consideration. 

Analogously, we might distinguish between errors for y = 1 and y = —1 (see, 
e.g., [331] for details). For instance, in a fraud detection application, we would like 
to be really sure about the situation before taking any measures, rather than losing 
potential customers. On the other hand, a blood bank should consider even the 
slightest suspicion of disease before accepting a donor. 

Rather than predicting only whether a given object x belongs to a certain class 
y, we may also want to take a certain confidence level into account. In this case, 
f(x) becomes a real-valued function, even though y € {—1,1}. 

In this case, sgn(f(x)) denotes the class label, and the absolute value |f(x)| the 
confidence of the prediction. Corresponding loss functions will depend on the 
product y f(x) to assess the quality of the estimate. The soft margin loss function, as 
introduced by Bennett and Mangasarian [40, 111], is defined as 


0 if yf(x) > 1, 


(3.3) 
1—yf(x) otherwise. 


c(x, y, f(x)) = max(0, 1 — yf(x)) = 
In some cases [348, 125] (see also Section 10.6.2) the squared version of (3.3) 
provides an expression that can be minimized more easily; 


c(x, y, f(x)) = max(0, 1 — yf(x))?. (3.4) 


The soft margin loss closely resembles the so-called logistic loss function (cf. 
[251], as well as Problem 3.1 and Section 16.1.1); 


c(x, y, f(x) = In (1 + exp (—yf(x))) « (3.5) 


We will derive this loss function in Section 3.3.1. It is used in order to associate a 
probabilistic meaning with f(x). 

Note that in both (3.3) and (3.5) (nearly) no penalty occurs if y f(x) is sufficiently 
large, i.e. if the patterns are classified correctly with large confidence. In particular, 
in (3.3) a minimum confidence of 1 is required for zero loss. These loss functions 


1. A false positive is a point which the classifier erroneously assigns to class 1, a false negative 
is erroneously assigned to class —1. 
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0-1 Loss Linear Soft Margin Logistic Regression Quadratic Soft Margin 
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4 4 4 4 
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y f(x) y f(x) y f(x) y f(x) 
Figure 3.1 From left to right: 0-1 loss, linear soft margin loss, logistic regression, and 
quadratic soft margin loss. Note that both soft margin loss functions are upper bounds 
on the 0-1 loss. 


led to the development of large margin classifiers (see [491, 460, 504] and Chapter 5 
for further details). Figure 3.1 depicts various popular loss functions.” 

Matters are more complex when dealing with more than two classes. Each 
type of misclassification could potentially incur a different loss, leading to an 
M x M matrix (M being the number of classes) with positive off-diagonal and zero 
diagonal entries. It is still a matter of ongoing research in which way a confidence 
level should be included in such cases (cf. [41, 311, 593, 161, 119]). 


3.1.2 Regression 


When estimating real-valued quantities, it is usually the size of the difference 
y — f(x), ie. the amount of misprediction, rather than the product yf(x), which 
is used to determine the quality of the estimate. For instance, this can be the actual 
loss incurred by mispredictions (e.g., the loss incurred by mispredicting the value 
of a financial instrument at the stock exchange), provided the latter is known and 
computationally tractable.? Assuming location independence, in most cases the 
loss function will be of the type 


e(x, y, f(x) = (f(x) — y). (3.7) 


See Figure 3.2 below for several regression loss functions. Below we list the ones 
most common in kernel methods. 


2. Other popular loss functions from the generalized linear model context include the 
inverse complementary log-log function. It is given by 


c(x, y, f(x)) = 1 — exp(— exp(y f (x))). (3.6) 


This function, unfortunately, is not convex and therefore it will not lead to a convex opti- 
mization problem. However, it has nice robustness properties and therefore we think that 
it should be investigated in the present context. 

3. As with classification, computational tractability is one of the primary concerns. This is 
not always satisfying from a statistician’s point of view, yet it is crucial for any practical 
implementation of an estimation algorithm. 


3.2 Test Error and Expected Risk 65 


Squared Loss 


€-insensitive 
Loss and 4 Loss 


Practical 
Considerations 


The popular choice is to minimize the sum of squares of the residuals f(x) — y. 
As we shall see in Section 3.3.1, this corresponds to the assumption that we have 
additive normal noise corrupting the observations y;. Consequently we minimize 


(x, y, f(x) = (F(x) — y)? or equivalently 2(£) = £. (3.8) 


For convenience of subsequent notation, 4¢° rather than ¿° is often used. 
An extension of the soft margin loss (3.3) to regression is the ¢-insensitive loss 
function [561, 572, 562]. It is obtained by symmetrization of the “hinge” of (3.3), 


e(€) = max(|g| — £,0) =: |E|e- (3.9) 


The idea behind (3.9) is that deviations up to £ should not be penalized, and all 
further deviations should incur only a linear penalty. Setting € = 0 leads to an (; 
loss, i.e., to minimization of the sum of absolute deviations. This is written 


e(£) = ISl. (3.10) 


We will study these functions in more detail in Section 3.4.2. 

For efficient implementations of learning procedures, it is crucial that loss func- 
tions satisfy certain properties. In particular, they should be cheap to compute, 
have a small number of discontinuities (if any) in the first derivative, and be con- 
vex in order to ensure the uniqueness of the solution (see Chapter 6 and also Prob- 
lem 3.6 for details). Moreover, we may want to obtain solutions that are compu- 
tationally efficient, which may disregard a certain number of training points. This 
leads to conditions such as vanishing derivatives for a range of function values 
f(x). Finally, requirements such as outlier resistance are also important for the con- 
struction of estimators. 


3.2 Test Error and Expected Risk 


Now that we have determined how errors should be penalized on specific in- 
stances (x, y, f(x)), we have to find a method to combine these (local) penalties. 
This will help us to assess a particular estimate f. 

In the following, we will assume that there exists a probability distribution 
P(x,y) on X x Y which governs the data generation and underlying functional 
dependency. Moreover, we denote by P(y|x) the conditional distribution of y given 
x, and by dP(x, y) and dP(y|x) the integrals with respect to the distributions P(x, y) 
and P(y|x) respectively (cf. Section B.1.3). 


3.2.1 Exact Quantities 


Unless stated otherwise, we assume that the data (x, y) are drawn iid (independent 
and identically distributed, see Section B.1) from P(x, y). Whether or not we have 
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knowledge of the test patterns at training time* makes a significant difference in 
the design of learning algorithms. In the latter case, we will want to minimize the 
test error on that specific test set; in the former case, the expected error over all possible 
test sets. 


Definition 3.2 (Test Error) Assume that we are not only given the training data 
{x1,...,Xm} along with target values {y1,...Ym} but also the test patterns {x},...x),, 

on which we would like to predict y; (i =1,...,m’). Since we already know xi, all we 
should care about is to minimize the expected error on the test set. We formalize this in 


the following definition 
1 m' 
Reel f= 55 È f eleh y, FDP). 6-11) 
i=1 


Unfortunately, this problem, referred to as transduction, is quite difficult to address, 
both computationally and conceptually, see [562, 267, 37, 211]. Instead, one typi- 
cally considers the case where no knowledge about test patterns is available, as 
described in the following definition. 


Definition 3.3 (Expected Risk) If we have no knowledge about the test patterns (or 
decide to ignore them) we should minimize the expected error over all possible training 
patterns. Hence we have to minimize the expected loss with respect to P and c 


RIF] := E [Ries[f]] = E [c(x, y, f(x))] = fo c(x, y, f(x))dP(x, y). (3.12) 


Here the integration is carried out with respect to the distribution P(x, y). Again, 
just as (3.11), this problem is intractable, since we do not know P(x, y) explicitly. 
Instead, we are only given the training patterns (x;, y;). The latter, however, allow 
us to replace the unknown distribution P(x, y) by its empirical estimate. 

To study connections between loss functions and density models, it will be 
convenient to assume that there exists a density p(x, y) corresponding to P(x, y). 
This means that we may replace fdP(x, y) by f p(x, y)dxdy and the appropriate 
measure on X x Y. Such a density p(x, y) need not always exist (see Section B.1 for 
more details) but we will not give further heed to these concerns at present. 


3.2.2 Approximations 


Unfortunately, this change in notation did not solve the problem. All we have at 
our disposal is the actual training data. What one usually does is replace p(x, y) by 
the empirical density 


Pemp(X, y) := ôx (x)ðy; (y). (3.13) 
=1 


1 
Mi 


4. The test outputs, however, are not available during training. 
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Here w(x) denotes the ĝ-distribution, satisfying f ô»(x)f(x)dx = f(x’). The hope 
is that replacing p by Pemp will lead to a quantity that is “reasonably close” to the 
expected risk. This will be the case if the class of possible solutions f is sufficiently 
limited [568, 571]. The issue of closeness with regard to different estimators will be 
discussed in further detail in Chapters 5 and 12. Substituting Pemp(x, y) into (3.12) 
leads to the empirical risk: 


Definition 3.4 (Empirical Risk) The empirical risk is defined as 


m 


Rem 12 fe fO Pen Nady = Y oli yi FX). 6.14) 


i=1 


This quantity has the advantage that, given the training data, we can readily 
compute and also minimize it. This constitutes a particular case of what is called an 
M-estimator in statistics. Estimators of this type are studied in detail in the field of 
empirical processes [554]. As pointed out in Section 3.1, it is crucial to understand 
that although our particular M-estimator is built from minimizing a loss, this need 
not always be the case. From a decision-theoretic point of view, the question of 
which loss to choose is a separate issue, which is dictated by the problem at hand 
as well as the goal of trying to evaluate the performance of estimation methods, 
rather than by the problem of trying to define a particular estimation method 
[582, 166, 43]. 

These considerations aside, it may appear as if (3.14) is the answer to our 
problems, and all that remains to be done is to find a suitable class of functions F 3 
f such that we can minimize Remp[f] with respect to F. Unfortunately, determining 
F is quite difficult (see Chapters 5 and 12 for details). Moreover, the minimization 
of Remp[f] can lead to an ill-posed problem [538, 370]. We will show this with a 
simple example. 

Assume that we want to solve a regression problem using the quadratic loss 
function (3.8) given by c(x, y, f(x) = (y — f(x))?. Moreover, assume that we are 
dealing with a linear class of functions,” say 


nafs 


where the f; are functions mapping X to R. 
We want to find the minimizer of Remp, i.e., 


i=1 


f(x) = y aj fi(x) with a; € R} ‘ (3.15) 


2 
ces ee d z 
minimize Rempl f] = minimize — 2; (v = 2 af) : (3.16) 


i=1 j=1 


5. In the simplest case, assuming X is contained in a vector space, these could be functions 
that extract coordinates of x; in other words, F would be the class of linear functions on X. 
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Computing the derivative of Remp[f] with respect to a and defining F;; := fi(x;), 
we can see that the minimum of (3.16) is achieved if 


Fly=F'Fa. (3.17) 


A sufficient condition for (3.17) is a = (FTF) 


(pseudo-)inverse of the matrix. 

If FTF has a bad condition number (i.e. the quotient between the largest and the 
smallest eigenvalue of F'F is large), it is numerically difficult [423, 530] to solve 
(3.17) for a. Furthermore, if n > m, i.e. if we have more basis functions f; than 
training patterns x;, there will exist a subspace of solutions with dimension at least 
n — m, satisfying (3.17). This is undesirable both practically (speed of computation) 
and theoretically (we would have to deal with a whole class of solutions rather 
than a single one). 

One might also expect that if F is too rich, the discrepancy between Remp[f] and 
R[f] could be large. For instance, if F is an m x m matrix of full rank, F contains 
an f that predicts all target values y; correctly on the training data. Nevertheless, 
we cannot expect that we will also obtain zero prediction error on unseen points. 
Chapter 4 will show how these problems can be overcome by adding a so-called 
regularization term to Remp[f]. 


F'y where (F'F) T! denotes the 


3.3 A Statistical Perspective 


Given a particular pattern ¥, we may want to ask what risk we can expect for it, 
and with which probability the corresponding loss is going to occur. In other words, 
instead of (or in addition to) E [c(£, y, f(X)] for a fixed %, we may want to know the 
distribution of y given g, i.e., P(y|%x). 

(Bayesian) statistics (see [338, 432, 49, 43] and also Chapter 16) often attempt 
to estimate the density corresponding to the random variables (x, y), and in some 
cases, we may really need information about p(x, y) to arrive at the desired conclu- 
sions given the training data (e.g., medical diagnosis). However, we always have 
to keep in mind that if we model the density p first, and subsequently, based on 
this approximation, compute a minimizer of the expected risk, we will have to 
make two approximations. This could lead to inferior or at least not easily pre- 
dictable results. Therefore, wherever possible, we should avoid solving a more 
general problem, since additional approximation steps might only make the esti- 
mates worse [561]. 


3.3.1 Maximum Likelihood Estimation 
All this said, we still may want to compute the conditional density p(y|x). For 


this purpose we need to model how y is generated, based on some underlying 
dependency f(x); thus, we specify the functional form of p(y|x, f(x)) and maximize 
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the expression with respect to f. This will provide us with the function f that is 
most likely to have generated the data. 


Definition 3.5 (Likelihood) The likelihood of a sample (x1, ¥1),..-(Xm; Ym) given an 
underlying functional dependency f is given by 


m 


p({x1, ease Xm}, {Y1, a -Ymyl f) = Irenyif = L puiki pi) (3.18) 
i=1 i=1 


Strictly speaking the likelihood only depends on the values f(x1),..., f(%m) rather 
than being a functional of f itself. To keep the notation simple, however, we 
write p({x1,...,Xm},{y1,.--;Ym}|f) instead of the more heavyweight expression 
p({x1, see ena {M1 ea Yd Hf), tae ifm} 

For practical reasons, we convert products into sums by taking the negative 
logarithm of P({x1,...,Xm},{y1,---,Ym}|f), an expression which is then conve- 
niently minimized. Furthermore, we may drop the p(x;) from (3.18), since they do 
not depend on f. Thus maximization of (3.18) is equivalent to minimization of the 
Log-Likelihood 


Ll := X$, —Inp(yilxi, f). (3.19) 
i=1 

Remark 3.6 (Regression Loss Functions) Minimization of L| f] and of R empl f] coin- 

cide if the loss function c is chosen according to 


c(x, y, f (x)) = -ln ply|x, f). (3.20) 


Assuming that the target values y were generated by an underlying functional dependency 
f plus additive noise € with density pẹ, i.e. Yi = ftrue(xi) + i, we obtain 


c(x, y, f(x)) = —In pely — f). (3.21) 


Things are slightly different in classification. Since all we are interested in is the 
probability that pattern x has label 1 or —1 (assuming binary classification), we 
can transform the problem into one of estimating the logarithm of the probability 
that a pattern assumes its correct label. 


Remark 3.7 (Classification Loss Functions) We have a finite set of labels, which al- 
lows us to model P(y|f(x)) directly, instead of modelling a density. In the binary classi- 
fication case (classes 1 and —1) this problem becomes particularly easy, since all we have 
to do is assume functional dependency underlying P(1|f(x)): this immediately gives us 
P(—1|f(x)) = 1 — P(1|f(x)). The link to loss functions is established via 


c(x, y, f(x)) = —InP(y| f(x). (3.22) 


The same result can be obtained by minimizing the cross entropy? between the classifica- 


6. In the case of discrete variables the cross entropy between two distributions P and Q is 
defined as $; P(i) In Q(i). 
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Table 3.1 Common loss functions and corresponding density models according to Re- 
mark 3.6. As a shorthand we use ĉ(f (x) — y) := c(x, y, f (x)). 


f loss function é(&) density model p(€) 
eri) 
perp le) 


| Gaussian | 5 = z + exp(— E) 


gausan s EE ifl|El<oa exp(— if |€|<o 
robust loss Le |- 4 otherwise exp($ — T |) otherwise 


Polynomial maa PIEN) 


Piecewise gml  iflél|<o exp(—4L) if |Elo 
polynomial |€|-o 4" otherwise exp(a + — |é|) otherwise 


tion labels y; and the probabilities p(y|f(x)), as is typically done in a generalized linear 
models context (see e.g., [355, 232, 163]). For binary classification (with y € {+1}) we 
obtain 


ctx, feo) = EY mP = afl) + SL nP = -1 F029. (3.23) 


When substituting the actual values for y into (3.23), this reduces to (3.22). 


At this point we have a choice in modelling P(y = 1|f(x)) to suit our needs. 
Possible models include the logistic transfer function, the probit model, the inverse 
complementary log-log model. See Section 16.3.5 for a more detailed discussion of 
the choice of such link functions. Below we explain connections in some more detail 
for the logistic link function. 

For a logistic model, where P(y = +1|x, f) x exp(+3f(x)), we obtain after nor- 
malization 


exp(f(x)) 
Py = 1|x, f) = 1+ exp(f(x)) (3.24) 
and consequently — 1n P(y = 1|x, f) = In(1 + exp(—f(x))). We thus recover (3.5) as 
the loss function for classification. Choices other than (3.24) for a map R — [0,1] 
will lead to further loss functions for classification. See [579, 179, 596] and Section 
16.1.1 for more details on this subject. 

It is important to note that not every loss function used in classification corre- 
sponds to such a density model (recall that in this case, the probabilities have to 
add up to 1 for any value of f(x)). In fact, one of the most popular loss functions, 
the soft margin loss (3.3), does not enjoy this property. A discussion of these issues 
can be found in [521]. 

Table 3.1 summarizes common loss functions and the corresponding density 
models as defined by (3.21), some of which were already presented in Section 
3.1. It is an exhaustive list of the loss functions that will be used in this book for 
regression. Figure 3.2 contains graphs of the functions. 
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Figure 3.2 Graphs of loss functions and corresponding density models. upper left: Gaus- 
sian, upper right: Laplacian, lower left: Huber’s robust, lower right: ¢-insensitive. 


Practical We conclude with a few cautionary remarks. The loss function resulting from 
Considerations a maximum likelihood reasoning might be non-convex. This might spell trouble 
when we try to find an efficient solution of the corresponding minimization prob- 
lem. Moreover, we made a very strong assumption by claiming to know P(y|x, f) 
explicitly, which was necessary in order to evaluate (3.20). 
Finally, the solution we obtain by minimizing the log-likelihood depends on 
the class of functions F. So we are in no better situation than by minimizing 
Remp[f], albeit with the additional constraint, that the loss functions c(x, y, f(x)) 
must correspond to a probability density. 


3.3.2 Efficiency 


The above reasoning could mislead us into thinking that the choice of loss func- 
tion is rather arbitrary, and that there exists no good means of assessing the per- 
formance of an estimator. In the present section we will develop tools which can 
be used to compare estimators that are derived from different loss functions. For 
this purpose we need to introduce additional statistical concepts which deal with 
the efficiency of an estimator. Roughly speaking, these give an indication of how 
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“noisy” an estimator is with respect to a reference estimator. 

We begin by formalizing the concept of an estimator. Denote by P(y|@) a dis- 
tribution of y depending (amongst other variables) on the parameters 6, and by 
Y = {y1,..-, Ym} an m-sample drawn iid from P(y|6). Note that the use of the sym- 
bol y bears no relation to the y; that are outputs of some functional dependency 
(cf. Chapter 1). We employ this symbol because some of the results to be derived 
will later be applied to the outputs of SV regression. 

Next, we introduce the estimator aY) of the parameters 6, based on Y. For 
instance, P(y|0) could be a Gaussian with fixed variance and mean 6, and 6(Y) 
could be the estimator (1/m) ¥'", yi. 

To avoid cumbersome notation, we use the shorthand 


Eo [é(u)] = En EU] = f aPC), (3.25) 


to express expectations of a random variable (y) with respect to P(y|@). One 
criterion that we might impose on an estimator is that it be unbiased, i.e., that 
on average, it tells us the correct value of the parameter it attempts to estimate. 


Definition 3.8 (Unbiased Estimator) An unbiased estimator 6(Y) of the parameters 8 
in P(y|@) satisfies 


Eg [20] =6. (3.26) 


In this section, we will focus on unbiased estimators. In general, however, the 
estimators we are dealing with in this book will not be unbiased. In fact, they 
will have a bias towards ‘simple’, low-complexity functions. Properties of such 
estimators are more difficult to deal with, which is why, for the sake of simplicity, 
we restrict ourselves to the unbiased case in this section. Note, however, that 
“biasedness” is not a bad property by itself. On the contrary, there exist cases as 
the one described by James and Stein [262] where biased estimators consistently 
outperform unbiased estimators in the finite sample size setting, both in terms of 
variance and prediction error. 

A possible way to compare unbiased estimators is to compute their variance. 
Other quantities such as moments of higher order or maximum deviation prop- 
erties would be valid criteria as well, yet for historical and practical reasons the 
variance has become a standard tool to benchmark estimators. The Fisher infor- 
mation matrix is crucial for this purpose since it will tell us via the Cramér-Rao 
bound (Theorem 3.11) the minimal possible variance for an unbiased estimator. 
The idea is that the smaller the variance, the lower (typically) the probability that 
6(Y) will deviate from 8 by a large amount. Therefore, we can use the variance as 
a possible one number summary to compare different estimators. 


Definition 3.9 (Score Function, Fisher Information, Covariance) Assume there ex- 
ists a density p(y|@) for the distribution P(y|@) such that In p(y|@) is differentiable with 
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Score Function 


Fisher 
Information 


Covariance 


Average Fisher 
Score Vanishes 


respect to 0. The score Va(Y) of P(y|9) is a random variable defined by” 


m m 0, Vi 8 
Va) = daln pYA) = Op $ In ply.|@) = $ SPIED 

i=1 E Pilo) 
This score tells us how much the likelihood of the data depends on the different components 
of 0, and thus, in the maximum likelihood procedure, how much the data affect the choice 


of 0. The covariance of Vg(Y) is called the Fisher information matrix I. It is given by 


(3.27) 


Ij = Eo [a In p(Y|8) - Op, In p(Y|6)| (3.28) 
and the covariance matrix B of the estimator 6(Y) is defined by 
Biy = Eo [(8:— E [0] (8;- E0 f2 ])]- 62) 


The covariance matrix B tells us the amount of variation of the estimator. It can 
therefore be used (e.g., by Chebychev’s inequality) to bound the probability that 
6(Y) deviates from @ by more than a certain amount. 


Remark 3.10 (Expected Value of Fisher Score) One can check that the expected value 
of Va(Y) is 0 since 


Eo [Va] = J p(Y10)ðoln p(Y|A)dY = do f p(Y|0)dY = ð1 = 0. (3.30) 


In other words, the contribution of Y to the adjustment of 0 averages to 0 over all possible 
Y, drawn according to P(Y|@). Equivalently we could say that the average likelihood for Y 
drawn according to P(Y |0) is extremal, provided we choose @: the derivative of the expected 
likelihood of the data Eg [In P(Y|@)] with respect to 6 vanishes. This is also what we expect, 
namely that the “proper” distribution is on average the one with the highest likelihood. 


The following theorem gives a lower bound on the variance of an estimator, i.e. 
B is found in terms of the Fisher information I. This is useful to determine how 
well a given estimator performs with respect to the one with the lowest possible 
variance. 


Theorem 3.11 (Cramér and Rao [425]) Any unbiased estimator 6(Y) satisfies 
det IB >1. (3.31) 


Proof We prove (3.31) for the scalar case. The extension to matrices is left as an 
exercise (see Problem 3.10). Using the Cauchy-Schwarz inequality, we obtain 


(Es [V0 - Ep LV) M -E po] (3.32) 


< Ep [(Vo(¥) — Eo [Vo] Eo (ce — Ey po aR. (3.33) 


7. Recall that Ogp(Y|6) is the gradient of p(Y|@) with respect to the parameters 6),..., 8n. 
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At the same time, Eg [Vg(Y)] = 0 implies that 


(Ea [V0 - Eo [VoD (80) -E paN (3.34) 
= Ep [vn] (3.35) 


= ( I piova) 


2 
= (a f ptr earn ) = (0,67 =1, (3.36) 
since we may interchange integration by Y and 0p. 7 


Eq. (3.31) lends itself to the definition of a one-number summary of the properties 
of an estimator, namely how closely the inequality is met. 


Definition 3.12 (Efficiency) The statistical efficiency e of an estimator (Y) is defined as 
e := 1/det IB. (3.37) 


The closer e is to 1, the lower the variance of the corresponding estimator 6(Y). 
For a special class of estimators minimizing loss functions, the following theorem 
allows us to compute B and e efficiently. 


Theorem 3.13 (Murata, Yoshizawa, Amari [379, Lemma 3]) Assume that ĝ is de- 
fined by @(Y) := argmin,d(Y,@) and that d is a twice differentiable function in 8. 
Then asymptotically, for increasing sample size m — ov, the variance B is given by 
B=Q-'GQ-". Here 


Gij = cove [osad (Y, 8), ðo d (Y, 6)| and (3.38) 
Qij = Eo ee 6)| (3.39) 
and therefore e = (det Q}? / (det IG). 


This means that for the class of estimators defined via d, the evaluation of their 
asymptotic efficiency can be conveniently achieved via (3.38) and (3.39). For scalar 
valued estimators (Y) € R, these expressions can be greatly simplified to 


fe I (agIn p(¥10)}? dP(Y10), (3.40) 
g= f (ApdY, 0) AP(Y 1), (3.41) 
6= f ZAY, 0)AP(Y 10). (3.42) 


Finally, in the case of continuous densities, Theorem 3.13 may be extended to 
piecewise twice differentiable continuous functions d, by convolving the latter 
with a twice differentiable smoothing kernel, and letting the width of the smooth- 
ing kernel converge to zero. We will make use of this observation in the next sec- 
tion when studying the efficiency of some estimators. 
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The current section concludes with the proof that the maximum likelihood 
estimator meets the Cramér-Rao bound. 


Theorem 3.14 (Efficiency of Maximum Likelihood [118, 218, 43]) The maximum 
likelihood estimator (cf. (3.18) and (3.19)) given by 


6(Y) := argmax In p(Y|9) = argmin L[6] (3.43) 
8 8 
is asymptotically efficient (e = 1). 


To keep things simple we will prove (3.43) only for the class of twice differentiable 
continuous densities by applying Theorem 3.13. For a more general proof see 
[118, 218, 43]. 


Proof By construction, G is equal to the Fisher information matrix, if we choose 
d according to (3.43). Hence a sufficient condition is that Q = —I, which is what 
we show below. To this end we expand the integrand of (3.42), 


e PID dpl _ pPI 
BACY, 8) = Jpn plo = y - (ee) -AP v. Ga 


The expectation of the second term in (3.44) equals —I. We now show that the 
expectation of the first term vanishes; 


O3p(Y|8) 
1 p10) e ae dY = & I p(Y|6)dY = 021 = 0. (3.45) 
Hence Q = —I and thus e = Q? /(IG) = 1. This proves that the maximum likelihood 
estimator is asymptotically efficient. a 


It appears as if the best thing we could do is to use the maximum likelihood (ML) 
estimator. Unfortunately, reality is not quite as simple as that. First, the above 
statement holds only asymptotically. This leads to the (justified) suspicion that 
for finite sample sizes we may be able to do better than ML estimation. Second, 
practical considerations such as the additional goal of sparse decomposition may 
lead to the choice of a non-optimal loss function. 

Finally, we may not know the true density model, which is required for the 
definition of the maximum likelihood estimator. We can try to make an educated 
guess; bad guesses of the class of densities, however, can lead to large errors in the 
estimation (see, e.g., [251]). This prompted the development of robust estimators. 


3.4 Robust Estimators 


So far, in order to make any practical predictions, we had to assume a certain 
class of distributions from which P(Y) was chosen. Likewise, in the case of risk 
functionals, we also assumed that training and test data are identically distributed. 
This section provides tools to safeguard ourselves against cases where the above 
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assumptions are not satisfied. 

More specifically, we would like to avoid a certain fraction v of ‘bad’ obser- 
vations (often also referred to as ‘outliers’) seriously affecting the quality of the 
estimate. This implies that the influence of individual patterns should be bounded 
from above. Huber [250] gives a detailed list of desirable properties of a robust 
estimator. We refrain from reproducing this list at present, or committing to a par- 
ticular definition of robustness. 

As usual for the estimation of location parameter context (i.e. estimation of the 
expected value of a random variable) we assume a specific parametric form of 
p(Y|@), namely 


p1) = TI pl) = [1 pyi— 0). (3.46) 
i=1 i=1 


Unless stated otherwise, this is the formulation we will use throughout this sec- 
tion. 


3.4.1 Robustness via Loss Functions 


Huber’s idea [250] in constructing a robust estimator was to take a loss function as 
provided by the maximum likelihood framework, and modify it in such a way 
as to limit the influence of each individual pattern. This is done by providing 
an upper bound on the slope of —Inp(Y|@). We shall see that methods such 
as the trimmed mean or the median are special cases thereof. The ¢-insensitive 
loss function can also be viewed as a trimmed estimator. This will lead to the 
development of adaptive loss functions in the subsequent sections. We begin with 
the main theorem of this section. 


Theorem 3.15 (Robust Loss Functions (Huber [250])) Let $ be a class of densities 
formed by 


PB := {p|p = (1 — £)po + ep1} where e € (0,1) and po are known. (3.47) 


Moreover assume that both po and pı are symmetric with respect to the origin, their 
logarithms are twice continuously differentiable, In po is convex and known, and pj is 
unknown. Then the density 


po(9) if |@| < 8o 


3.48 
po(Ao)e~*\9l-%) otherwise ae 


p(8) = (1—€) 
is robust in the sense that the maximum likelihood estimator corresponding to (3.48) has 
minimum variance with respect to the “worst” possible density Pworst = (1 — €)po + ep1: 
it is a saddle point (located at Pworst) in terms of variance with respect to the true density 
p € Band the density p € P used in estimating the location parameter. This means that 
no density p has larger variance than Pworst and that for p = Pworst no estimator is better 
than the one where P = Pworst, as used in the robust estimator. 

The constants k > 0 and @o are obtained by the normalization condition, that p be a 
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proper density and that the first derivative in In p be continuous. 


Proof ‘To show that p is a saddle point in 8 we have to prove that (a) no estima- 
tion procedure other than the one using In pas the loss function has lower variance 
for the density p, and that (b) no density has higher variance than fp if In p is used 
as loss function. Part (a) follows immediately from the Cramér-Rao theorem (Th. 
3.11); part (b) can be proved as follows. 

We use Theorem 3.13, and a proof technique pointed out in [559], to compute 
the variance of an estimator using In J as loss function; 


z — S (2n plyl))” (0 = Apoyo) + <p'(yl8) dy 
J ln ply|A) (A — €)poy|9) + ep'(y|A)) dy ` 


Here p’ is an arbitrary density which we will choose such that B is maximized. By 


(3.49) 


construction, 
: daln poll) <k if [y — 8| < bo, 
J ao ( 3.50 
(aln py|4)) f K otherwise, a 
2] > if |y — 8| < 
Binply|o) = ¢ NPN if |y- Al <4, (3.51) 
0 otherwise. 


Thus any density p’ which is 0 in [—69, 8o] will minimize the denominator (the 
term depending on p’ will be 0, which is the lowest obtainable value due to (3.51)), 
and maximize the numerator, since in the latter the contribution of p’ is always 
limited to k’e. Now e7! (p — (1 — €)po) is exactly such a density. Hence the saddle 
point property holds. a 


Remark 3.16 (Robustness Classes) If we have more knowledge about the class of den- 
sities P, a different loss function will have the saddle point property. For instance, using 
a similar argument as above, one can show that the normal distribution is robust in the 
class of all distributions with bounded variance. This implies that among all possible dis- 
tributions with bounded variance, the estimator of the mean of a normal distribution has 
the highest variance. 

Likewise, the Laplacian distribution is robust in the class of all symmetric distributions 
with density p(0) > c for some fixed c > 0 (see [559, 251] for more details). 


Hence, even though a loss function defined according to Theorem 3.15 is generally 
desirable, we may be less cautious, and use a different loss function for improved 
performance, when we have additional knowledge of the distribution. 


Remark 3.17 (Mean and Median) Assume we are dealing with a mixture of a normal 
distribution with variance o? and an additional unknown distribution with weight at most 
£. It is easy to check that the application of Theorem 3.15 to normal distributions yields 
Huber’s robust loss function from Table 3.1. 

The maximizer of the likelihood (see also Problem 3.17) is a trimmed mean estimator 
which discards € of the data: effectively all 6; deviating from the mean by more than o are 
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ignored and the mean is computed from the remaining data. Hence Theorem 3.15 gives a 
formal justification for this popular type of estimator. 

Tf we let e — 1 we recover the median estimator which stems from a Laplacian distribu- 
tion. Here, all patterns but the median one are discarded. 


Besides the classical examples of loss functions and density models, we might also 
consider a slightly unconventional estimation procedure: use the average between 
the k-smallest and the k-largest of all observations 6 observations as the estimated 
mean of the underlying distribution (for sorted observations 6; with 6; < 6; for 
1<i< j< mthe estimator computes (6% + Om-k+1)/2). This procedure makes 
sense, for instance, when we are trying to infer the mean of a random variable 
generated by roundoff noise (i.e., noise whose density is constant within some 
bounded interval) plus an additional unknown amount of noise. 

Note that both the patterns strictly inside or outside an interval of size [—e, e] 
around the estimate have no direct influence on the outcome. Only patterns on the 
boundary matter. This is a very similar situation to the behavior of Support Vector 
Machines in regression, and one can show that it corresponds to the minimizer 
of the ¢-insensitive loss function (3.9). We will study the properties of the latter in 
more detail in the following section and thereafter show how it can be transformed 
into an adaptive risk functional. 


3.4.2 Efficiency and the <-Insensitive Loss Function 


The tools of Section 3.3.2 allow us to analyze the ¢-insensitive loss function in more 
detail. Even though the asymptotic estimation of a location parameter setting is a 
gross oversimplification of what is happening in a SV regression estimator (where 
we estimate a nonparametric function, and moreover have only a limited number 
of observations at our disposition), it will provide us with useful insights into this 
more complex case [510, 481]. 

Ina first step, we compute the efficiency of an estimator, for several noise models 
and amounts of variance, using a density corresponding to the €-insensitive loss 
function (cf. Table 3.1); 


1 if |y — 6] < 
: fly A <2, oy 


1 
e 8) = ——— —ly—@ e) = = 
pe(yl®) 2+2¢ exp(-ly— Ale) 2+2e | exp(e—|y—4|) otherwise. 


For this purpose we have to evaluate the quantities G (3.41) and Q (3.42) of 
Theorem 3.13. We obtain 


G=m f (Ompa =m (1- f poloa), 653) 
Q=m f din ply|AdPtyld)= m (P(e + 616) + ple + 610). (3.54) 


The Fisher information I of m iid random variables distributed according to pg is 
m-times the value of a single random variable. Thus all dependencies on m in e 
cancel out and we can limit ourselves to the case of m = 1 for the analysis of the 
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efficiency of estimators. 

Now we may check what happens if we use the ¢-insensitive loss function for 
different types of noise model. For the sake of simplicity we begin with Gaussian 
noise. 


Example 3.18 (Gaussian Noise) Assume that y is normally distributed with zero mean 
(i.e. @ = 0) and variance o. By construction, the minimum obtainable variance is I7! = g? 
(recall that m = 1). Moreover (3.53) and (3.54) yield 


G 2 
J= o2exp (=) (1 — erf =) (3.55) 


The efficiency e = g is maximized for e = 0.61200. This means that if the underlying 
noise model is Gaussian with variance o and we have to use an €-insensitive loss function 
to estimate a location parameter, the most efficient estimator from this family is given by 
€ = 0.61200. 


The consequence of (3.55) is that the optimal value of £ scales linearly with ø. 
Of course, we could just use squared loss in such a situation, but in general, we 
will not know the exact noise model, and squared loss does not lead to robust 
estimators. The following lemma (which will come handy in the next section) 
shows that this is a general property of the <-insensitive loss. 


Lemma 3.19 (Linear Dependency between ¢-Tube Width and Variance) Denote 
by pa symmetric density with variance o > 0. Then the optimal value of £ (i.e. the value 
that achieves maximum asymptotic efficiency) for an estimator using the €-insensitive loss 
is given by 
1 T 

Eopt = o argmin ————_—______,, (1 — / psalma’) ; (3.56) 

4 (Psa(—7) + Psta(T)) E 
where Psta(T) := op(oT + 6|6) is the standardized version of p(y|0), i.e. it is obtained by 


rescaling p(y|@) to zero mean and unit variance. 


Since Pstq is independent of a, we have a linear dependency between £opt and øo. 
The scaling factor depends on the noise model. 


Proof We prove (3.56) by rewriting the efficiency e(<) in terms of Psa via p(y|@) = 
o~'psa(o7'(y — 0)). This yields 


Q _ (a7 peal—o7e) +. opsal)? (Pata 7) + Psala ™e))? 


Te =~ Gas eae same) (1 = SZS Paa(0)d0) 


The maximum of e(¢) does not depend directly on £, but on a~'e (which is 
independent of a). Hence we can find argmax , e(e) by solving (3.56). a 


Lemma 3.19 made it apparent that in order to adjust € we have to know ø be- 
forehand. Unfortunately, the latter is usually unknown at the beginning of the 
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estimation procedure.’ The solution to this dilemma is to make £ adaptive. 
3.4.3 Adaptive Loss Functions 


We again consider the trimmed mean estimator, which discards a predefined 
fraction of largest and smallest samples. This method belongs to the more general 
class of quantile estimators, which base their estimates on the value of samples 
in a certain quantile. The latter methods do not require prior knowledge of the 
variance, and adapt to whatever scale is required. What we need is a technique 
which connects ø (in Huber’s robust loss function) or e€ (in the ¢-insensitive loss 
case) with the deviations between the estimate Î and the random variables yj. 

Let us analyze what happens to the negative log likelihood, if, in the €- 
insensitive case, we change € to € + 6 (with ô € R) while keeping ĝ fixed. In par- 
ticular we assume that |6| is chosen sufficiently small such that for alli =1,...,m, 


‘ <e+6 if|@-yil<e 
pw lage 


ae (3.57) 
>e+d if|@-yi|>e 


Moreover denote by m<, m=, m» the number of samples for which |Î — y;| is less 
than, equal to, or greater than e, respectively. Then 


m 


¥ 18 = vilers = > |ô — yile+ > |ô — yile —msd + > lô — vilexs 
i=1 


|0-vil<e |O-yil>e |O—yilee 
m x ô if ô 0 

=> ð-yil -4 ">? Dik (3.58) 
i=l (m< +m-_)d otherwise. 


In other words, the amount by which the loss changes depends only on the 
quantiles at e. What happens if we make e itself a variable of the optimization 
problem? By the scaling properties of (3.58) one can see that for v € [0,1] 


1 m R 
minimize — Ý |6 — yiļe — ve (8.59) 
ðe mM i 
is minimized if ¢ is chosen such that 
M epg e TG, (3.60) 
m m 


This relation holds since at the solution (ĝ, €) the solution also has to be optimal 
wrt. € alone while keeping Ô fixed. In the latter case, however, the derivatives of 


8. The obvious question is why one would ever like to choose an €-insensitive loss in the 
presence of Gaussian noise in the first place. If the complexity of the function expansion is 
of no concern and the highest accuracy is required, squared loss is to be preferred. In most 
cases, however, it is not quite clear what exactly the type of the additive noise model is. This 
is when we would like to have a more conservative estimator. In practice, the ¢-insensitive 
loss has been shown to work rather well on a variety of tasks (Chapter 9). 
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the log-likelihood (i.e. error) term wrt. € at the solution are given by “= and “=+"= 
on the left and right hand side respectively? These have to cancel with v which 
proves the claim. Furthermore, computing the derivative of (3.59) with respect to 
6 shows that the number of samples outside the interval [0 — £, 0 + £e] has to be 


equal on both halves (—oo, 8 — £) and (0 + £, 00). We have the following theorem: 


Theorem 3.20 (Quantile Estimation as Optimization Problem [481]) A quantile 
procedure to estimate the mean of a distribution by taking the average of the samples 
at the th and (1 — 5)th quantile is equivalent to minimizing (3.59). In particular, 


1. vis an upper bound on the fraction of samples outside the interval [0 — £, 0 + e]. 
2. v is a lower bound on the fraction of samples outside the interval ]@ — £, 0 + el. 
3. If the distribution p(@) is continuous, for all v € [0,1] 


m= 
lim P {Z= <e} =1 forall e >0. (3.61) 
One might question the practical advantage of this method over direct trimming 
of the sample Y. In fact, the use of (3.59) is not recommended if all we want is to 
estimate 6. That said, (3.59) does allow us to employ trimmed estimation in the 
nonparametric case, cf. Chapter 9. 

Unfortunately, we were unable to find a similar method for Huber’s robust loss 


Extension to function, since in this case the change in the negative log-likelihood incurred by 
General Robust changing o not only involves the (statistical) rank of y;, but also the exact location 
Estimators of samples with |y; — 8| < o. 


One way to overcome this problem is re-estimate g adaptively while minimizing 
aterm similar to (3.59) (see [180] for details in the context of boosting, Section 10.6.3 
for a discussion of online estimation techniques, or [251] for a general overview). 


3.4.4 Optimal Choice of v 


Let us return to the e€-insensitive loss. A combination of Theorems 3.20, 3.13 and 
Lemma 3.19 allows us to compute optimal values of v for various distributions, 
provided that an ¢-insensitive loss function is to be used in the estimation proce- 
dure.10 

The idea is to determine the optimal value of € for a fixed density p(y|6) via 
(3.56), and compute the corresponding fraction v of patterns outside the interval 
[-e+6,e+ 96]. 


9. Strictly speaking, the derivative is not defined at £; the Ihs and rhs values are defined, 
however, which is sufficient for our purpose. 

10. This is not optimal in the sense of Theorem 3.15, which suggests the use of a more 
adapted loss function. However (as already stated in the introduction of this chapter), 
algorithmic or technical reasons such as computationally efficient solutions or limited 
memory may provide sufficient motivation to use such a loss function. 
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Table 3.2 Optimal v and £ for various degrees of polynomial additive noise. 


[Optimal =forunit variance | 0 | 0.6120 | 11180 | 13583 | 1.484 | 


Poiynomi Degree | 6 | 7 |s | 9 | 10 
D1080 


Theorem 3.21 (Optimal Choice of v) Denote by p a symmetric density with variance 
o > Oand by psa the corresponding rescaled density with zero mean and unit variance. 
Then the optimal value of v (i.e. the value that achieves maximum asymptotic efficiency) 
for an estimator using the ¢-insensitive loss is given by 


E 

v=1- f paalydy (3.62) 
—E 

where £ is chosen according to (3.56). This expression is independent of o. 


Proof The independence of o follows from the fact that v depends only on Peta- 
Next we show (3.62). For a given density p, the asymptotically optimal value of 
e is given by Lemma 3.19. The average fraction of patterns outside the interval 
[ô — Eopt; ô + Eopt] is 


Eop +O O7! Eopt 

v=1- |" _polDdy=1- | ™ paalyay, (3.63) 
—Eopt + —o Eopt 

which depends only on o~'éop¢ and is thus independent of o. Combining (3.63) 

with (3.56) yields the theorem. a 


This means that given the type of additive noise, we can determine the value of 
v such that it yields the asymptotically most efficient estimator independent of the 
level of the noise. These theoretical predictions have since been confirmed rather 
accurately in a set of regression experiments [95]. 

Let us now look at some special cases. 


Example 3.22 (Optimal v for Polynomial Noise) Arbitrary polynomial noise models 
(oc e~!6l") with unit variance can be written as 


d 
1 /T(3/d d T (3/d 
p(y) = Cp exp (-c)lyl”) where Cp = z T and c, = ( | 3 
where T(x) is the gamma function. Figure 3.3 shows Vop for polynomial degrees in the 
interval [1,10]. For convenience, the explicit numerical values are repeated in Table 3.2. 
Observe that as the distribution becomes “lighter-tailed”, the optimal v decreases; in 
other words, we may then use a larger amount of the data for the purpose of estimation. 
This is reasonable since it is only for very long tails of the distribution (data with many 
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Figure 3.3 Optimal v and £ for various degrees of polynomial additive noise. 
outliers) that we have to be conservative and discard a large fraction of observations. 


Even though we derived these relations solely for the case where a single number 
(8) has to be estimated, experiments show that the same scaling properties hold 
for the nonparametric case. It is still an open research problem to establish this 
connection exactly. 

As we shall see, in the nonparametric case, the effect of v will be that it both 
determines the number of Support Vectors (i.e., the number of basis functions 
needed to expand the solution) and also the fraction of function values f(x;) with 
deviation larger than € from the corresponding observations. Further information 
on this topic, both from the statistical and the algorithmic point of view, can be 
found in Section 9.3. 


3.5 Summary 


We saw in this chapter that there exist two complementary concepts as to how risk 
and loss functions should be designed. The first one is data driven and uses the 
incurred loss as its principal guideline, possibly modified in order to suit the need 
of numerical efficiency. This leads to loss functions and the definitions of empirical 
and expected risk. 

A second method is based on the idea of estimating (or at least approximating) 
the distribution which may be responsible for generating the data. We showed 
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that in a Maximum Likelihood setting this concept is rather similar to the notions 
of risk and loss, with c(x, y, f(x)) = — 1n p(y|x, f(x)) as the link between both quan- 
tities. 

This point of view allowed us to analyze the properties of estimators in more 
detail and provide lower bounds on the performance of unbiased estimators, i.e. 
the Cramér-Rao theorem. The latter was then used as a benchmarking tool for 
various loss functions and density models, such as the «-insensitive loss. The 
consequence of this analysis is a corroboration of experimental findings that there 
exists a linear correlation between the amount of noise in the observations and the 
optimal width of e. 

This, in turn, allowed us to construct adaptive loss functions which adjust 
themselves to the amount of noise, much like trimmed mean estimators. These 
formulations can be used directly in mathematical programs, leading to v-SV 
algorithms in subsequent chapters. The question of which choices are optimal in a 
finite sample size setting remains an open research problem. 


3.6 Problems 


3.1 (Soft Margin and Logistic Regression e) The soft margin loss function csop and 
the logistic loss Ciogist are asymptotically almost the same; show that 


lim (Caonle, Lf) = Clogist( X, 1, f)) =1 (3.64) 

fro 

lim (Coane 1,7) = Clogist(X, 1, f)) = 0. (3.65) 
f>- 


3.2 (Multi-class Discrimination ee) Assume you have to solve a classification problem 
with M different classes. Discuss how the number of functions used to solve this task 
affects the quality of the solution. 


= How would the loss function look if you were to use only one real-valued function 
f:X — R Which symmetries are violated in this case (hint: what happens if you permute 
the classes)? 


= How many functions do you need if each of them makes a binary decision f : X — {0,1}? 


= How many functions do you need in order to make the solution permutation symmetric 
with respect to the class labels? 


= How should you assess the classification error? Is it a good idea to use the misclassifica- 
tion rate of one individual function as a performance criterion (hint: correlation of errors)? 
By how much can this error differ from the total misclassification error? 


3.3 (Mean and Median e) Assume 8 people want to gather for a meeting; 5 of them live 
in Stuttgart and 3 in Munich. Where should they meet if (a) they want the total distance 
traveled by all people to be minimal, (b) they want the average distance traveled per person 
to be minimal, or (c) they want the average squared distance to be minimal? What happens 
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to the meeting points if one of the 3 people moves from Munich to Sydney? 


3.4 (Locally Adaptive Loss Functions eee) Assume that the loss function c(x, y, f(x)) 
varies with x. What does this mean for the expected loss? Can you give a bound on the 
latter even if you know p(y|x) and f at every point but know c only on a finite sample 
(hint: construct a counterexample)? How will things change if c cannot vary much with 
x? 


3.5 (Transduction Error eee) Assume that we want to minimize the test error of mis- 
classification Rtes[f], given a training sample {(x1,Y1),.--,(Xm,Ym)}, a test sample 
{x1 -xX 9} and a loss function c(x, y, f(x). 

Show that any loss function c’(x’, f(x')) on the test sample has to be symmetric in f, 
ie. (x, f(x) = d(x’, — f(x"). Prove that no non-constant convex function can satisfy 
this property. What does this mean for the practical solution of optimization problem? See 


[267, 37, 211, 103] for details. 


3.6 (Convexity and Uniqueness ee) Show that the problem of estimating a location 
parameter (a single scalar) has an interval [a,b] C R of equivalent global minima if the 
loss functions are convex. For non-convex loss functions construct an example where this 
is not the case. 


3.7 (Linearly Dependent Parameters ee) Show that in a linear model f = X; aif; on 
X it is impossible to find a unique set of optimal parameters a; if the functions f; are not 
linearly independent. Does this have any effect on f itself? 


3.8 (Ill-posed Problems eee) Assume you want to solve the problem Ax = y where A 
is a symmetric positive definite matrix, i.e., a matrix with nonnegative eigenvalues. If you 
change y to y’, how much will the solution x’ of Ax' = y’ differ from x’. Give lower and 
upper bounds on this quantity. Hint: decompose y into the eigensystem of A. 

3.9 (Fisher Map [258] ee) Show that the map 

U(x) := 17? dln p(x|4) (3.66) 
maps x into vectors with zero mean and unit variance. Chapter 13 will use this map to 


design kernels. 


3.10 (Cramér-Rao Inequality for Multivariate Estimators ee) Prove equation (3.31). 
Hint: start by applying the Cauchy-Schwarz inequality to 


(det Eel(8(6) — £,0(8))(Te(0) — ETO") (3.67) 


to obtain I and B and compute the expected value coefficient-wise. 


3.11 (Soft Margin Loss and Conditional Probabilities [521] eee) What is the con- 
ditional probability p(y|x) corresponding to the soft margin loss function c(x, y, f(x)) = 
max(0,1— yf (x))? 


86 


Risk and Loss Functions 


= How can you fix the problem that the probabilities p(—1|x) and p(1|x) have to sum up 
to 1? 


= How does the introduction of a third class (“don’t know”) change the problem? What is 
the problem with this approach? Hint: What is the behavior for large |f (x)|? 


3.12 (Label Noise ee) Denote by P(y = 1|f(x)) and P(y = —1|f(x)) the conditional 
probabilities of labels +1 for a classifier output f(x). How will P change if we randomly 
flip labels with n € (0,1) probability? How should you adapt your density model? 


3.13 (Unbiased Estimators ee) Prove that the least mean square estimator is unbiased 
for arbitrary symmetric distributions. Can you extend the result to arbitrary symmetric 
losses? 


3.14 (Efficiency of Huber’s Robust Estimator ee) Compute the efficiency of Huber’s 
Robust Estimator in the presence of pure Gaussian noise with unit variance. 


3.15 (Influence and Robustness eee) Prove that for robust estimators using (3.48) as 
their density model, the maximum change in the minimizer of the empirical risk is bounded 
by © if a sample 6; is changed to 6; + 6. What happens in the case of Gaussian density 
models (i.e., squared loss)? 


3.16 (Robustness of Gaussian Distributions [559] eee) Prove that the normal distri- 
bution with variance g? is robust among the class of distributions with bounded variance 
(by o°). Hint: show that we have a saddle point analogous to Theorem 3.15 by exploiting 
Theorems 3.13 and Theorem 3.14. 


3.17 (Trimmed Mean ee) Show that under the assumption of an unknown distribution 
contributing at most £, Huber’s robust loss function for normal distributions leads to a 
trimmed mean estimator which discards £ of the data. 


3.18 (Optimal v for Gaussian Noise e) Give an explicit solution for the optimal v in 
the case of additive Gaussian noise. 


3.19 (Optimal v for Discrete Distribution ee) Assume that we have a noise model 
with a discrete distribution of 0, where P(@ = e) = P(@ = —€) = pı, P(0 = 2e) = P(0 = 
—2€) = pr, 2(pi + po) = 1, and py, p2 > 0. Compute the optimal value of v. 


Overview 


Prerequisites 


Regularization 


Minimizing the empirical risk can lead to numerical instabilities and bad general- 
ization performance. A possible way to avoid this problem is to restrict the class of 
admissible solutions, for instance to a compact set. This technique was introduced 
by Tikhonov and Arsenin [538] for solving inverse problems and has since been 
applied to learning problems with great success. In statistics, the corresponding 
estimators are often referred to as shrinkage estimators [262]. 

Kernel methods are best suited for two special types of regularization: a coef- 
ficient space constraint on the expansion coefficients of the weight vector in feature 
space [343, 591, 37, 517, 189], or, alternatively, a function space regularization di- 
rectly penalizing the weight vector in feature space [573, 62, 561]. In this chapter we 
will discuss the connections between regularization, Reproducing Kernel Hilbert 
Spaces (RKHS), feature spaces, and regularization operators. The connection to 
Gaussian Processes will be explained in more detail in Section 16.3. These differ- 
ent viewpoints will help us to gain insight into the success of kernel methods. 

We start by introducing regularized risk functionals (Section 4.1), followed by a 
discussion of the Representer Theorem describing the functional form of the mini- 
mizers of a certain class of such risk functionals (Section 4.2). Section 4.3 introduces 
regularization operators and details their connection to SV kernels. Sections 4.4 
through 4.6 look at this connection for specific classes of kernels. Following that, 
we have several sections dealing with various regularization issues of interest for 
machine learning: vector-valued functions (Section 4.7), semiparametric regular- 
ization (Section 4.8), and finally, coefficient-based regularization (Section 4.9). 

This chapter may not be be easy to digest for some of our readers. We recom- 
mend that most readers should nevertheless consider going through Sections 4.1 
and 4.2. Those two sections are accessible with the background given in Chapters 
1 and Chapter 2. The following Section 4.3 is somewhat more technical, since it is 
using the concept of Green’s functions and operators, but should nevertheless still 
be looked at. A background in functional analysis will be helpful. 

Sections 4.4, 4.5, and 4.6 are more difficult, and require a solid knowledge of 
Fourier integrals and elements of the theory of special functions. To understand 
Section 4.7, some basic notions of group theory are beneficial. Finally, Sections 4.8 
and Section 4.9 do not require additional knowledge beyond the basic concepts 
put forward in the introductory chapters. Yet, some readers may find it beneficial 
to read these two last sections after they gained a deeper insight into classification, 
regression and mathematical programming, as provided by Chapters 6, 7, and 9. 
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4.1 The Regularized Risk Functional 


Continuity 
Assumption 


The key idea in regularization is to restrict the class of possible minimizers F (with 
f € F) of the empirical risk functional Remp[f] such that F becomes a compact set. 
While there exist various characterizations for compact sets and we may define 
a large variety of such sets which will suit different assumptions on the type 
of estimates we get, the common key idea is compactness. In addition, we will 
assume that Remp[f] is continuous in f. 

Note that this is a stronger assumption than it may appear at first glance. It is 
easily satisfied for many regression problems, such as those using squared loss or 
the <-insensitive loss. Yet binary valued loss functions, as are often used in classifi- 
cation (such as c(x, y, f(x)) = $(1 — sgn yf(x))),do not meet the requirements. Since 
both the exact minimization of Remp[f] for classification problems [367], even with 
very restricted classes of functions, and also the approximate solution to this prob- 
lem [20] have been proven to be NP-hard, we will not bother with this case any 
further, but rather attempt to minimize a continuous approximation of the 0 — 1 
loss, such as the one using a soft margin loss function (3.3). 

We may now apply the operator inversion lemma to show that for compact J, 
the inverse map from the minimum of the empirical risk functional Remp[f] : F > 
R to its minimizer f is continuous and the optimization problem well-posed. 


Theorem 4.1 (Operator Inversion Lemma (e.g., [431])) Let X be a compact set and 
let the map f : X — Y be continuous. Then there exists an inverse map f7! : f(X) + X 
that is also continuous. 


We do not directly specify a compact set F, since this leads to a constrained 
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optimization problem, which can be cumbersome in practice. Instead, we add 
a stabilization (regularization) term Q[f] to the original objective function; the 
latter could be Remp[f], for instance. This, too, leads to better conditioning of the 
problem. We consider the following class of regularized risk functionals (see also 
Problem 4.1) 


Rreg[ f] = Rempl f] F AQ[f]. (4.1) 


Here A > 0 is the so-called regularization parameter which specifies the trade- 
off between minimization of Remp[f] and the smoothness or simplicity which is 
enforced by small Q[f]. Usually one chooses Q[f] to be convex, since this ensures 
that there exists only one global minimum, provided Remp[f] is also convex (see 
Lemma 6.3 and Theorem 6.5). 

Maximization of the margin of classification in feature space by using the regu- 
larizing term 5||w||?, and thus minimizing 


1 À 
QL] := sllwll’, and therefore Rreg[ f] = Remp[f] + slwl, (4.2) 


is the common choice in SV classification [573, 62]. In regression, the geometrical 
interpretation of minimizing 4||w||? is to find the flattest function with sufficient 
approximation qualities. Unless stated otherwise, we will limit ourselves to this 
type of regularizer in the present chapter. Other methods, e.g., minimizing the £, 
norm (where ||x||} = £; x?) of the expansion coefficients for w, will be discussed in 
Section 4.9. 

As described in Section 2.2.3, we can equivalently think of the feature space as 
a reproducing kernel Hilbert space. It is often useful, and indeed it will be one of 
the central themes of this chapter, to rewrite the risk functional (4.2) in terms of the 
RKHS representation of the feature space. In this case, we equivalently minimize 


Reegl f= Renpl f1 + ŽI (43) 


over the whole space H. The next section will study the properties of minimizers 
of (4.3), and similar regularizers that depend on ||f||s¢. 


4.2 The Representer Theorem 


History of the 
Representer 
Theorem 


The explicit form of a minimizer of Rreg[f] is given by the celebrated representer 
theorem of Kimeldorf and Wahba [296] which plays a central role in solving prac- 
tical problems of statistical estimation. It was first proven in the context of squared 
loss functions, and later extended to general pointwise loss functions [115]. For a 
machine learning point of view of the representer theorem, and variational proofs, 
see [205, 512]. The linear case has also been dealt with in [300]. We present a new 
and slightly more general version of the theorem with a simple proof [473]. As 
above, H is the RKHS associated to the kernel k. 
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Requirements on 


Q[f] 


Significance 


Sparsity and Loss 
Function 


Regularization 


Theorem 4.2 (Representer Theorem) Denote by Q : [0,00) > Ra strictly monotonic 
increasing function, by X a set, and by c : (X x R)” — RU {oo} an arbitrary loss 
function. Then each minimizer f € H of the regularized risk 


c ((x1, Yı, f(x1)) ar) (Xm, Ym, f (®m))) +Q (ILAllac) (4.4) 


admits a representation of the form 


f(x) = 5 aik(x;, x). (4.5) 
i 


Note that this setting is slightly more general than Definition 3.1 since it allows 
coupling between the samples (x;, y;). 

Before we proceed with the actual proof, let us make a few remarks. The original 
form, with pointwise mean squared loss 


m 


een yn FED), -Em Ym SEDD = E Dyi— E (4.6) 
i=1 


or hard constraints (i.e., hard limits on the maximally allowed error, incorporated 
formally by using a cost function that takes the value oo), and Q(||f|) = XIIF lle 
(A > 0), is due to Kimeldorf and Wahba [296]. 

Monotonicity of Q is necessary to ensure that the theorem holds. It does not 
prevent the regularized risk functional (4.4) from having multiple local minima. 
To ensure a single minimum, we would need to require convexity. If we discard 
the strictness of the monotonicity, then it no longer follows that each minimizer of 
the regularized risk admits an expansion (4.5); it still follows, however, that there 
is always another solution that is as good, and that does admit the expansion. 

Note that the freedom to use regularizers other than Q(||f||) = 4||f||3, allow us 
in principle to design algorithms that are more closely aligned with recommenda- 
tions given by bounds derived from statistical learning theory, as described below 
(cf. Problem 5.7). 

The significance of the Representer Theorem is that although we might be trying 
to solve an optimization problem in an infinite-dimensional space K, containing 
linear combinations of kernels centered on arbitrary points of X, it states that 
the solution lies in the span of m particular kernels — those centered on the 
training points. In the Support Vector community, (4.5) is called the Support Vector 
expansion. For suitable choices of loss functions, it has empirically been found that 
many of the a; often equal 0 (see Problem 4.6 for more detail on the connection 
between sparsity and loss functions). 


Proof For convenience we will assume that we are dealing with Q((||f||?) := 
Q(|| f ||) rather than Q(||f||). This is no restriction at all, since the quadratic function 
is strictly monotonic on [0, co), and therefore Q is strictly monotonic on [0, 00) if 
and only if Q also satisfies this requirement. 

We may decompose any f € H into a part contained in the span of the kernel 
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Prior Knowledge 
by Parametric 
Expansions 


Bias 


functions k(x}, -),---,k(%m, +), and one in the orthogonal complement; 
f(x) = fy) + fx) = ¥ aiki x) + f(x). (4.7) 
i=1 


Here a; € Rand f, € H with (f1,k(xi,+))4¢ = 0 for all i € [m] := {1,...,m}. By 
(2.34) we may write f(x ;) (for all j € [m]) as 


fle) = FOE) = ¥ akan) + FLOM Do = F aken) (48) 
i=1 i=1 


2 2 
+e) 29 ( i (4.9) 
K H 


Thus for any fixed a; € R the risk functional (4.4) is minimized for f; = 0. Since 
this also has to hold for the solution, the theorem holds. E 


Second, for all f}, 


aoza | 


D9 aik(xi, ‘) 


5 aik(xi, ‘) 


Let us state two immediate extensions of Theorem 4.2. The proof of the following 
theorem is left as an exercise (see Problem 4.3). 


Theorem 4.3 (Semiparametric Representer Theorem) Suppose that in addition to 
the assumptions of the previous theorem we are given a set of M real-valued functions 
{tp H : X — R with the property that the m x M matrix (Y p(xi))ip has rank M. Then 
any f := f +h, with f € H and h € span {wp}, minimizing the regularized risk 


c (x1, y1,f(x1)), re) (Xm, Vent Bn) +Q (ILAllac) (4.10) 


admits a representation of the form 
m M 

F(x) = X aiki) + X, Boh), (4.11) 
i=1 p=1 


with By € R for all p € [M]. 


We will discuss applications of the semiparametric extension in Section 4.8. 


Remark 4.4 (Biased Regularization) Another extension of the representer theorems 
can be obtained by including a term — (fo, f) in (4.4) or (4.10), where fo € H. In this 
case, if a solution to the minimization problem exists, it admits an expansion which differs 
from those described above in that it additionally contains a multiple of fo. To see this, 
decompose f \(-) used in the proof of Theorem 4.2 into a part orthogonal to fo and the 
remainder. 


Biased regularization means that we do not assume that the function f = 0 is 
the most simple of all estimates. This is a convenient way of incorporating prior 
knowledge about the type of solution we expect from our estimation procedure. 
After this rather abstract and formal treatment of regularization, let us consider 
some practical cases where the representer theorem can be applied. First consider 
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the problem of regression, where the solution is chosen to be an element of a 
Reproducing Kernel Hilbert Space. 


Example 4.5 (Support Vector Regression) For Support Vector regression with the e- 
insensitive loss (Section 1.6) we have 


e (2:4 f@D) ze) = =) Iyi- fæle (4.12) 


and Q (IfI) = 4 |f ||’, where A > O and e > 0 are fixed parameters which determine the 
trade-off Hees regularization and fit to the training set. In addition, a single (M = 1) 
constant function (x) = 1 is used as an offset, and is not regularized by the algorithm. 

Section 4.8 and [507] contain details how the case of M > 1, for which more than one 
parametric function is used, can be dealt with algorithmically. Theorem 4.3 also applies in 
this case. 


Example 4.6 (Support Vector Classification) Here, the targets consist of y; € {+1}, 
and we use the soft margin loss function (3.3) to obtain 


c ((xi;, Yi f (xi));) = 1 Smax (0, t= yif (xi) f (4.13) 


The regularizer is Q (|\f\|) = 4 à IFIP, and q(x) = 1. For A > 0, we recover the hard 
margin SVM, for which the ee must correctly classify each training point (xj, yi). 
Note that after training, the actual classifier will be sgn (f(.)). 


Example 4.7 (Kernel PCA) Principal Component Analysis (see Chapter 14 for details) 
in a kernel feature space can be shown to correspond to the case of 


0 Fise iae =1 


: (4.14) 
oo otherwise 


C((Xis Yi, f%i))i) = 
with Q(.) an arbitrary function that is strictly monotonically increasing [480]. The con- 
straint ensures that we only consider linear feature extraction functionals that produce 
outputs of unit empirical variance. In other words, the task is to find the simplest function 
with unit variance. Note that in this case of unsupervised learning, there are no labels y; 
to consider. 


4.3 Regularization Operators 


Curse of 
Dimensionality 


The RKHS framework proved useful in obtaining the explicit functional form of 
minimizers of the regularized risk functional. It still does not explain the good per- 
formance of kernel algorithms, however. In particular, it seems counter-intuitive 
that estimators using very high dimensional feature spaces (easily with some 10" 
features as in optical character recognition with polynomial kernels, or even infi- 
nite dimensional spaces in the case of Gaussian RBF-kernels) should exhibit good 
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performance. It seems as if kernel methods are defying the curse of dimensionality 
[29], which requires the number of samples to increase with the dimensionality of 
the space in which estimation is performed. However, the distribution of capacity 
in these spaces is not isotropic (cf. Section 2.2.5). 

The basic idea of the viewpoint described in the present section is simple: rather 
than dealing with an abstract quantity such as an RKHS, which is defined by 
means of its corresponding kernel k, we take the converse approach of obtaining 
a kernel via the corresponding Hilbert space. Unless stated otherwise, we will use 
L»(X) as the Hilbert space (cf. Section B.3) on which the regularization operators 
will be defined. Note that L2(X) is not the feature space H. 

Recall that in Section 2.2.2, we showed that one way to think of the kernel 
mapping is as a map that takes a point x € X to a function k(x,.) living in an 
RKHS. To do this, we constructed a dot product (., .)4, satisfying 


k(x, x!) = (k(x, .), k(x, ac. (4.15) 


Physically, however, it is still unclear what the dot product (f, ¢),, actually does. 
Does it compute some kind of “overlap” of the functions, similar to the usual 
dot product between functions in L2(X)? Recall that, assuming we can define an 
integral on X, the latter is (cf. (B.60)) 


(fs 8) 10%) = ffs (4.16) 


In the present section, we will show that whilst our dot product in the RKHS is 
not quite a simple as (4.16), we can at least write it as 


Fg) = OF ¥8)1, = I, YF (x)Ne(x)dx (4.17) 


in a suitable L space of functions. This space contains transformed versions or the 
original functions, where the transformation Y “extracts” those parts that should 
be affected by the regularization. This gives a much clearer physical understand- 
ing of the dot product in the RKHS (and thus of the similarity measure used by 
SVMs). It becomes particularly illuminating once one sees that for common ker- 
nels, the associated transformation Y extracts properties like derivatives of func- 
tions. In other words, these kernels induce a form of regularization that penalizes 
non-smooth functions. 


Definition 4.8 (Regularization Operator) A regularization operator Y is defined as a 
linear map from the space of functions F := {f |f : X — R} into a space equipped with a 
dot product. The regularization term QJ f] takes the form 


OLfl:= 5 (FTP). (4.18) 


Without loss of generality, we may assume that Y is positive definite. This can be 
seen as follows: all that matters for the definition of Q[f] is the positive definite 
operator Y*Y (since (Xf, Vf) = (f, Y*Yf)). Hence we may always define a positive 
definite operator Y), := (Y*Y)2 (cf. Section B.2.2) which has the same regulariza- 
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tion properties as Y. Next, we formally state the equivalence between RKHS and 
regularization operator view. 


Theorem 4.9 (RKHS and Regularization Operators) For every RKHS H with repro- 
ducing kernel k there exists a corresponding regularization operator Y : H — D such that 
forall f EH 


(Yk(x, D YFC) = f); (4.19) 
and in particular, 
(Yk(x, +), Yk(x', -))o = k(x, x): (4.20) 


Likewise, for every regularization operator Y : F — D, where F is some function space 
equipped with a dot product, there exists a corresponding RKHS KH with reproducing 
kernel k such that (4.19) and (4.20) are satisfied. 


Equation (4.20) will become the central tool to analyze smoothness properties of 
kernels, in particular if we pick D to be L2(X). In this case we will obtain an explicit 
form of the dot product induced by the RKHS which will thereby clarify why 
kernel methods work. 

From Section 2.2.4 we can see that minimization of ||w||* is equivalent to mini- 
mization of Q[f] (4.18), due to the feature map ®(x) := k(x, +). 


Proof We prove the first part by explicitly constructing an operator that takes 
care of the mapping. One can see immediately that Y = 1 and D = X will satisfy 
all requirements.! 

For the converse statement, we have to obtain k from Y*Y and show that this is, 
in fact, the kernel of an RKHS (note that this does not imply that D = H since it 
may be equipped with a different dot product than H). 

A function G,(-) satisfying the first equality in 


f(x) = (YG), fg = (YGy, Vf) y (4.21) 


for all f € Y*YF is called Green’s function of the operator Y*Y on D. It is known 
that such functions exist [448]. Note that this amounts to our desired reproducing 
property (4.19), on the set Y*YF. The second equality in (4.21) follows from the 
definition of the adjoint operator Y*. 

By applying (4.21) to G, it follows immediately that G is symmetric, 


G(x’) = YX Gy, Ge) = (XGyv, TGs, = 1G, 169), = GA), (4.22) 


We will write it as G(x, x’). Observe that (4.22) actually tells us that x + YG, is 
actually a valid feature map for G. Therefore, we may identify G(x, x’) with k(x, x’). 


1. Y =1is not the most useful operator. Typically we will seek an operator Y corresponding 
to a specific dot product space D. Note that this need not always be possible if D is not 
suitably chosen, e.g., for D = R. 
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The corresponding RKHS is the closure of the set {f € Y*YF|||Yf||? < oo}. m 


This means that D is an RKHS with inner product (Y-, Y-}p. Furthermore, Theo- 
rem 4.9 means that fixing the regularization operator Y determines the possible 
set of functions that we might obtain, independently? of the class of functions 
in which we expand the estimate f. Thus Support Vector Machines are simply a 
very convenient way of specifying the regularization and a matching class of basis 
functions via one kernel function. This is done mainly for algorithmic advantages 
when formulating the corresponding optimization problem (cf. Chapter 7). The 
case where the two do not match is discussed in detail in [512]. 

Given the eigenvector decomposition of a regularization operator we can define 
a class of kernels that satisfy the self consistency condition (4.20). 


Proposition 4.10 (A Discrete Counterpart) Given a regularization operator Y with an 

expansion of Y*Y into a discrete eigenvector decomposition (An, Yn), and a kernel k with 
d 

kx, x)= S, S hal’), (4.23) 


S 7" 


where d, € {0,1} for all m, and >, = convergent, then k satisfies (4.20). Moreover, the 
corresponding RKHS is given by span{y);\d; = 1 andi € N}. 


Proof We evaluate (4.21) and use the orthonormality of the system ($, Wn). 


(k(x), -), Q*Yk)(x;, -)) (4.24) 
- (z Sela.) YY (5 a) ) 
dn dy * 
= Yo Ay OD Vel) (Wnl.), X Yuw(.)) 


dn 
=F S rDV) = k(x;, xj). 


The statement about the span follows immediately from the construction of k. 
Ei 


The summation coefficients are permitted to be rearranged, since the eigenfunc- 
tions are orthonormal and the series ¥,, L converges absolutely. Consequently a 
large class of kernels can be associated with a given regularization operator (and 
vice versa), thereby restricting us to a subspace of the eigenvector decomposition 
of Y*Y. 

In other words, there exists a one to one correspondence between kernels and 
regularization operators only on the image of H under the integral operator 


2. Provided that no f € D contains directions of the null space of the regularization op- 
erator Y*Y, and that the kernel functions k span the whole space D. If this is not the case, 
simply define the space to be the span of k(x, -). 


96 


Regularization 


(Ty f)(x) := f k(x, x!) f(x)dx, namely that T, and Y*Y are inverse to another. On the 
null space of Tą, however, the regularization operator Y*Y may take on an arbitrary 
form. In this case k still will fulfill the self consistency condition. 

Excluding eigenfunctions of Y*Y from the kernel expansion effectively decreases 
the expressive power of the set of approximating functions, and limits the capacity 
of the system of functions. Removing low capacity (i.e. very flat) eigenfunctions 
from the expansion will have an adverse effect, though, as the data will then be 
approximated by the higher capacity functions. 

We have now covered the main insights of the present chapter. The following 
sections are more technical and can be skipped if desired. Recall that at the be- 
ginning of the present section, we explained that regularization operators can be 
thought of as extracting those parts of the functions that should be affected by the 
regularization. In the next section, we show that for a specific class of kernels, this 
extraction coincides with the Fourier transform. 


4.4 Translation Invariant Kernels 


An important class of kernels k(x, x’), such as Gaussian RBF kernels or Laplacian 
kernels only depends on the difference between x and x’. For the sake of simplicity 
and with slight abuse of notation we will use the shorthand 


k(x, x’) = k(x — x’) (4.25) 


or simply k(x). Since such k are independent of the absolute position of x but 
depend only on x — x’ instead, we will refer to them as translation invariant kernels. 

What we will show in the following is that for kernels defined via (4.25) there 
exists a simple recipe how to find a regularization operator Y*Y corresponding to 
k and vice versa. In particular, we will show that the Fourier transform of k(x) will 
provide us with the representation of the regularization operator in the frequency 
domain. 


Fourier Transformation For this purpose we need a few definitions. For the sake 
of simplicity we assume X C RV. In this case the Fourier transformation of f is 
given by 


FIFI) = @ny-* f foxpexp(—i (x, w))ax. (4.26) 


Note that here i is the imaginary unit and that, in general, F[ f](w) € Cis a complex 
number. The inverse Fourier transformation is then given by 


f(x) = FL (w) = (27)? [FU exp(i (x, w))dw. (4.27) 


Regularization Operator in Fourier Domain We now specifically consider regu- 
larization operators Y that may be written as multiplications in Fourier space (i.e. 
Y*¥ is diagonalized in the Fourier basis). 
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Denote by u(w) a nonnegative, symmetric function defined on X, i.e. u(—w) = 
u(w) > 0 which converges to 0 for ||w|| + 00. Moreover denote by Q the support 
of u(w) and by x the complex conjugate of x. Now we introduce a regularization 
operator by 


FEKO) a 


T (4.28) 


OLY) =m? f 
The goal of regularized risk minimization in RKHS is to find a function which 
minimizes Ryeg[f] while keeping (Yf,Yf)., reasonably small. In the context of 
(4.28) this means the following: 

Small nonzero values of u(w) correspond to a strong attenuation of the correspond- 
ing frequencies. Hence small values of u(w) for large w are desirable, since high 
frequency components of F[f] correspond to rapid changes in f. It follows that 
u(w) describes the filter properties of Y*Y — note that no attenuation takes place 
for u(w) = 0, since these frequencies have been excluded from the integration do- 
main Q. 

Our next step is to construct kernels k corresponding to Y as defined in (4.28). 


Green’s Functions and Fourier Transformations We show that 
G(x,x') = (2n)-¥ f doe- Www, (4.29) 
Q 


is a Green’s function for Y,D and that it can be used as a kernel. For a function f, 
whose support of its Fourier transform is contained in Q, we have 


(GO), fn = 20E f AE te (4.30) 
_ -X v(w) expli (x, w))FLF](w) 
= (27) f a A (4.31) 
(2m)? i expli (x, w))ELfw)des = f(x). (4.32) 
From Theorem 4.9 it now follows that G is a Green’s function and that it can be 


used as an RKHS kernel. 

Eq. (4.29) provides us with an efficient tool for analyzing SV kernels and the types 
of capacity control they exhibit: we may also read (4.29) backwards and, in doing 
so, find the regularization operator for a given kernel, simply by applying the 
Fourier transform to k(x). As expected, kernels with high frequency components 
will lead to less smooth estimates. 

Note that (4.29) is a special case of Bochner’s theorem [60], which states that the 
Fourier transform of a positive measure constitutes a positive definite kernel. 


In the remainder of this section we will now apply our new insight to a wide 
range of popular kernels such as B,,-splines, Gaussian kernels, Laplacian kernels, 
and periodic kernels. A discussion of the multidimensional case which requires 
additional mathematical techniques is left to Section 4.5. 
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4.4.1 B,-Splines 


As was briefly mentioned in Section 2.3, splines are an important tool in inter- 
polation and function estimation. They excel at problems of low dimensional in- 
terpolation. Computational problems become increasingly acute, however, as the 
dimensionality of the patterns (i.e. of x) increases; yet there exists a way to circum- 
vent these difficulties. In [501, 572], a method is proposed for using B,-splines (see 
Figure 4.1) as building blocks for kernels, i.e., 


k(x) = B,(x). (4.33) 


We start with X = R (higher dimensional cases can also be obtained, for instance by 
taking products over the individual dimensions). Recall that B, splines are defined 
as n + 1 convolutions® of the centered unit interval (cf. (2.71) and [552]); 


n+1 
By = &)I-05,05): (4.34) 
=) 


Given this kernel, we now use (4.29) in order to obtain the corresponding Fourier 
representation. In particular, we must compute the Fourier transform of B,,(x). The 
following theorem allows us to do this conveniently for functions represented by 
convolutions. 


Theorem 4.11 (Fourier-Plancherel, e.g. [306, 112]) Denote by f,g two functions in 
Lo(X), by FL], Flg] their corresponding Fourier transforms, and by ® the convolution 
operation. Then the following identities hold. 


Ff 88] = FIf]- Fig], and F[f] 8 Flg] = FIf - 8] (4.35) 


In other words, convolutions in the original space become products in the Fourier 
domain and vice versa. Hence we may jump from one representation to the other 
depending on which space is most convenient for our calculations. 

Repeated application of Theorem 4.11 shows that in the case of B,, splines, the 
Fourier representation is conveniently given by the n + 1st power of the Fourier 
transform of Bo. Since the Fourier transform of B, equals u(w), we obtain (up to a 
multiplicative constant) 


N , 
vlw) = Fik (w) = [ [sinc “+” (=) , where sinc x := mr, (4.36) 
i=l 


1= 


3. A convolution f ® g of two functions f,g : X — R is defined as 
fg = (2r) Ù f fgl xdx'. 
% 
The normalization factor of (2r)? serves to make the convolution compatible with the 


Fourier transform. We will need this property in Theorem 4.11. Note that f 8 g = g 8 f, as 
can be seen by exchange of variables. 
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Only Bon 41 
Splines 
Admissible 


1 1 1 1 
0.5 0.5 0.5 0.5 
0 0 0 0 
-2 0 2 -2 0 2 -2 0 2 -2 0 2 
1 1 1 1 
0.5 0.5 0.5 0.5 
0 0 0 0 
-5 0 5 -5 0 5 -5 0 5 -5 0 5 


Figure 4.1 From left to right: B, splines of order 0 to 3 (top row) and their Fourier 
transforms (bottom row). The length of the support of B, is n+ 1, and the degree of 
continuous differentiability increases with n — 1. Note that the higher the degree of B,,, the 
more peaked the Fourier transform (4.36) becomes. This is due to the increasing support of 
B,,. The frequency axis labels of the Fourier transform are multiples of 27. 


This illustrates why only B, splines of odd order are positive definite kernels (cf. 
(2.71)):4 The even ones have negative components in the Fourier spectrum (which 
would result in an amplification of the corresponding frequencies). The zeros in 
F[k] stem from the fact that B, has compact support; (-“4, sH], See Figure 4.2 
for details. 

By using this kernel, we trade reduced computational complexity in calculat- 
ing f (we need only take points into account whose distance ||x; — x;|| is smaller 
than the support of B,), for a potentially decreased performance of the regular- 
ization operator, since it completely removes (i.e., disregards) frequencies wp with 
F[k](w») = 0. Moreover, as we shall see below, in comparison to other kernels, such 
as the Gaussian kernel, F[k](w) decays rather slowly. 


4.4.2 Gaussian Kernels 


Another class of kernels are Gaussian radial basis function kernels (Figure 4.3). 
These are widely popular in Neural Networks and approximation theory [80, 203, 
— e) in (2.68); we now 


investigate the regularization and smoothness properties of these kernels. 
For a Fourier representation we need only compute the Fourier transform of 


201, 420]. We have already encountered k(x, x’) = exp ( 


4. Although both even and odd order B,, splines converge to a Gaussian as n — oo due to 
the law of large numbers. 
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Figure 4.2 Left: B3-spline kernel. Right: Fourier transform of k (in log-scale). Note the zeros 


and the rapid decay in the spectrum of B3. 


(2.68), which is given by 


W 


se 
F[k](w) = v(w) = |o| exp (- 5 ) ; (4.37) 


In other words, the smoother k is in pattern space, the more peaked its Fourier 
transform becomes. In particular, the product between the width of k and its 
Fourier transform is constant.” This phenomenon is also known as the uncertainty 
relation in physics and engineering. 

Equation (4.37) also means that the contribution of high frequency components 
in estimates is relatively small, since u(w) decays extremely rapidly. It also helps 
explain why Gaussian kernels produce full rank kernel matrices (Theorem 2.18). 

We next determine an explicit representation of ||Yf||? in terms of differential 
operators, rather than a pure Fourier space formalism. While this is not possible 
by using only “conventional” differential operators, we may achieve our goal by 
using pseudo-differential operators. 

Roughly speaking, a pseudo-differential operator differs from a differential op- 
erator in that it may contain an infinite sum of differential operators. The latter 
correspond to a Taylor expansion of the operator in the Fourier domain. There is 
an additional requirement that the arguments lie inside the radius of convergence, 
however. 

Following the exposition of Yuille and Grzywacz [612] one can see that 

2 


Al? = > — (O"f(x)}’dx, (4.38) 


n! 


with O*" = A" and O7"+! = VA", A being the Laplacian and V the Gradient oper- 
ator, is equivalent to a regularization with v(w) as in (4.37). The key observation 
in this context is that derivatives in X translate to multiplications in the frequency 


5. The multidimensional case is completely analogous, since it can be decomposed into a 
product of one-dimensional Gaussians. See also Section 4.5 for more details. 
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Figure 4.3 Left: Gaussian kernel with standard deviation 0.5. Right: Fourier transform of 
the kernel. 


domain and vice versa. Therefore a Taylor expansion of v(w) in w, can be rewrit- 
ten as a Taylor expansion in X in terms of differential operators. See [612] and the 
references therein for more detail. 

On the practical side, training an SVM with Gaussian RBF kernels [482] corre- 
sponds to minimizing the specific loss function with a regularization operator of 
type (4.38). Recall that (4.38) causes all derivatives of f to be penalized, to obtain 
a very smooth estimate. This also explains the good performance of SVMs in this 
case, since it is by no means obvious that choosing a flat function in some high di- 
mensional space will correspond to a simple function in a low dimensional space 


(see Section 4.4.3 for a counterexample). 


4.4.3 Dirichlet Kernels 


Proposition 4.10 can also be used to generate practical kernels. In particular, [572] 
introduced a class of kernel based on Fourier expansions by 


n in(2 1) 
kcas eas SE (4.39) 
= sin $ 


As in Section 4.4.1, we consider x € R to avoid tedious notation. By construction, 


this kernel corresponds to u(w) = 3 j 5 6;(w), with 4; being Dirac’s delta function. 


1=—n 


A regularization operator with these properties may not be desirable, however, 
asit only damps a finite number of frequencies (see Figure 4.4), and leaves all other 
frequencies unchanged, which can lead to overfitting (Figure 4.5). 


6. Integrability considerations aside, one can see this by 


= us [ Ffo) exp(iwx)dw = [ iwFL fw) expliwx)dur. 
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Figure 4.4 Left: Dirichlet kernel of order 10. Note that this kernel is periodic. Right: Fourier 
transform of the kernel. 
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Figure 4.5 Left: Regression with a Dirichlet Kernel of order N = 10. One can clearly 
observe the overfitting (solid line: interpolation, ’+’: original data). Right: Regression based 
on the same data with a Gaussian Kernel of width o? = 1 (dash dotted line: interpolation, 
‘+’: original data). 


In other words, this kernel only describes band-limited periodic functions where 
no distinction is made between the different components of the frequency spec- 
trum. Section 4.4.4 will present an example of a periodic kernel with a more suit- 
able distribution of capacity over the frequency spectrum. 

In some cases, it might be useful to approximate periodic functions, for instance 
functions defined on a circle. This leads to the second possible type of translation 
invariant kernel function, namely functions defined on factor spaces’. It is not 
reasonable to define translation invariant kernels on a bounded interval, since the 
data will lie beyond the boundaries of the specified interval when translated by 
a large amount. Therefore unbounded intervals and factor spaces are the only 
possible domains. 


7. Factor spaces are vector spaces X, with the additional property that for at least one 
nonzero element ĉ € X, we have x + ĉ = x for all x € X. For instance, the modulo operation 
on Z forms such a space. We denote this space by Z/%. 
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We assume a period of 27 without loss of generality, and thus consider trans- 
lation invariance on R/27. The next section shows how this setting affects the 
operator defined in section 4.4.2. 


4.4.4 Periodic Kernels 


One way of dealing with periodic invariances is to begin with a translation invari- 
ant regularization operator, defined similarly to (4.38), albeit on L>([0,27]) (where 
the points 0 and 7 are identified) rather than on L>(R), and to find a matching 
kernel function. We start with the regularization operator; 


EAP = a E TOf Par (4.40) 


with O defined as in Section 4.4.2. For the sake of simplicity, assume dim X = 1. A 
generalization to multidimensional kernels is straightforward. 

To obtain the eigensystem of Y we start with the Fourier basis, which is dense 
on L2([0, 27]) [69], the space of functions we are interested in. One can check that 
the Fourier basis {x, sin(nx), cos(nx), n € N} is an eigenvector decomposition of 
the operator defined in (4.40), with eigenvalues exp( 52 ), by substitution into 
(4.40). Due to the Fourier basis being dense in L2([0,27]), we have thus identified 
all eigenfunctions of Y. Next we apply Proposition 4.10, taking into account all 
eigenfunctions except the constant function with n = 0. This yields the following 
kernel, 


es 2 
k(x, x)=}, e- F (sin(nx) sin(nx’) + cos(nx) cos(nx’)) 
n=1 


= ¥ e™F cos(n(x — x). (4.41) 


For practical purposes, one may truncate the expansion after a finite number of 
terms. Since the expansion coefficients decay rapidly, this approximation is very 
good. If necessary, k can be rescaled to have a range of exactly [0, 1]. 

While this is a convenient way of building kernels if the Fourier expansion is 
known, we would also like to be able to render arbitrary translation invariant 
kernels on R periodic. The method is rather straightforward, and works as follows. 
Given any translation invariant kernel k we obtain k, by 

kp(x, x’) := X k(x — x! +27). (4.42) 

neZ 
Again, we can approximate (4.42) by truncating the sum after a finite number 
of terms. The question is whether the definition of k, leads to a positive definite 
kernel at all, and if so, which regularization properties it exhibits. 


Proposition 4.12 (Spectrum of Periodized Kernels) Denote by k a translation in- 
variant kernel in L(X), and by ky its periodization according to (4.42). Moreover denote 
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Figure 4.6 Left: Periodic Gaussian kernel for several values of g (normalized to 1 as its 
maximum and 0 as its minimum value). Peaked functions correspond to small ø. Right: 
Fourier coefficients of the kernel for o? = 0.1. 


by F[f] the Fourier transform of f. Then kp can be expanded into the series 


j=l 


kep(x,x') = (2m)? (mmo +2 $ FL FIG) cos(j(x — “») (4.43) 


Proof The proof makes use of the fact that for Lebesgue integrable functions k the 
integral over X can be split up into a sum over segments of size 27. Specifically, 
we obtain 


(2n)7? [koed =(2n)?> he A nerds (4.44) 
jez“ Wer 
= (27)? f eS" k(x + 2m f)dx (4.45) 
10,27] jEZ 
= (2n)-3 I, at koe. (4.46) 
327 


The latter, however, is the Fourier transform of ky over the interval [0,27]. Hence 
we have F[k](j) = F[kp](j) for j € Z, where F[kp](j) denotes the Fourier transform 
over the compact set [0, 27]. 

Now we may use the inverse Fourier transformation on [0,27], to obtain a 
decomposition of k, into a trigonometric series. Due to the symmetry of k, the 
imaginary part of F[f] vanishes, and thus all contributions of sin jx cancel out. 
Moreover, we obtain (4.43) since cos x is a symmetric function. a 


In some cases, the full summation of kp can be computed in closed form. See 
Problem 4.10 for an application of this reasoning to Laplacian kernels. 

In the context of periodic functions, the difference between this kernel and the 
Dirichlet kernel of Section 4.4.3 is that the latter does not distinguish between the 
different frequency components in w € {—n7,...,n7}. 
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4.4.5 Practical Implications 


We are now able to draw some useful conclusions regarding the practical applica- 
tion of translation invariant kernels. Let us begin with two extreme situations. 


= Suppose that the shape of the power spectrum Pow[f](w) of the function we 
would like to estimate is known beforehand. In this case, we should choose k such 
that F[k] matches the expected value of the power spectrum of f. The latter is given 
by the squared absolute value of the Fourier transformation of f, i.e., 


PowLfl(w) := [FLAP (4.47) 


One may check, using the Fourier-Plancherel equality (Theorem 4.11) that Pow[f] 
equals the Fourier transformation of the autocorrelation function of f, given by 
f(x) 8 f(—x). In signal processing this is commonly known as the problem of 
“matched filters” [581]. It has been shown that the optimal filter for the reconstruc- 
tion of signals corrupted with white noise, has to match the frequency distribution 
of the signal which is to be reconstructed. (White noise has a uniform distribution 
over the frequency band occupied by the useful signal.) 


= If we know very little about the given data, however, it is reasonable to make 
a general smoothness assumption. Thus a Gaussian kernel as in Section 4.4.2 or 
4.4.4 is recommended. If computing time is important, we might instead consider 
kernels with compact support, such as the B,,-spline kernels of Section 4.4.1. This 
choice will cause many matrix elements k;; = k(x; — xj) to vanish. 


The usual scenario will be in between these two extremes, and we will have some 
limited prior knowledge available, which should be used in the choice of kernel. 
The goal of the present reasoning is to give a guide to selection of kernels through 
a deeper understanding of the regularization properties. For more information on 
using prior knowledge for choosing kernels, e.g. by explicit construction of kernels 
exhibiting only a limited amount of interaction, see Chapter 13. 

Finally, note that the choice of the kernel width may be more important than 
the actual functional form of the kernel. For instance, there may be little difference 
in the relevant filter properties close to w = 0 between a B-spline and a Gaussian 
kernel (cf. Figure 4.7). This heuristic holds if we are interested only in uniform 
convergence results of a certain degree of precision, in which case only a small 
part of the power spectrum of k is relevant (see [604, 606] and also Section 12.4.1). 


4.5 Translation Invariant Kernels in Higher Dimensions 


Product Kernels 


Things get more complicated in higher dimensions. There are basically two ways 
to construct kernels in RY x RY > R with N > 1, if no particular assumptions on 
the data are made. First, we could construct kernels k : RY x RY > R by 


k(x — x’) = k(x1 — x) +...+k(xy — x'N). (4.48) 
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Figure 4.8 Laplacian product kernel in R and R’. Note the preferred directions in the two 
dimensional case. 


Note that we have deviated from our usual notation in that in the present section, 
we use bold face letters to denote elements of the input space. This will help to 
simplify the notation, using x = (x1, . . ., Xy) and w = (w1, . . . , W4) below. 

The choice (4.48) usually leads to preferred directions in input space (see Fig- 
ure 4.8), since the kernels are not generally rotation invariant, the exception be- 
ing Gaussian kernels. This can also be seen from the corresponding regularization 


operator. Since k factorizes, we can apply the Fourier transform to k on a per- 
dimension basis, to obtain 


FIK(w) = E[k\(w1) +. . -+ FIK (wN). (4.49) 


The second approach is to assume k(x — x’) = k(||x — x’||,,). This leads to kernels 
which are both translation invariant and rotation invariant. It is quite straightfor- 
ward to generalize the exposition to the rotation asymmetric case, and norms other 
than the £2 norm. We now recall some basic results which will be useful later. 


4.5 Translation Invariant Kernels in Higher Dimensions 107 


Fourier 
Transform 


Hankel 
Transform 


Bessel Function 


Gaussian —> 
Gaussian 


4.5.1 Basic Tools 


The N-dimensional Fourier transform is defined as 


2 N N * TE 1 —i(w,x 
F: L)(RY) > L2(RY) with FLf](w) := TrA Í 2 WX) F(x)dx. (4.50) 
Its inverse transform is given by 

=M N Nyari —1 = 1 i(w.x 
F-!: L2(RY) > L2(RY) with Fo'Lf](x) = aN Í ser ) F(w) dw. (4.51) 


For radially symmetric functions, i.e. f(x) = f(||x||), we can explicitly carry out 
the integration on the sphere to obtain a Fourier transform which is also radially 
symmetric (cf. [520, 373]): 


FLFM(w||) =o" ALE’ Ow, (4.52) 


where v := 5d — 1, and H, is the Hankel transform over the positive real line (we 
use the shorthand w = ||w||). The latter is defined as 


OO 
HAF) = [rf twrr. (4.53) 
Here J, is the Bessel function of the first kind, which is given by 
Dir 


(4.54) 


[o.e) 
v — ey Se 
J (r) F EEDEN 


and T(x) is the Gamma function, satisfying T(n + 1) = n! for n € N. 

Note that H, = H7!, ie. f = H,[H,[f]] (in L2) due to the Hankel inversion 
theorem [520] (see also Problem 4.11), which is just another way of writing the 
inverse Fourier transform in the rotation symmetric case. Based on the results 
above, we can now use (4.29) to compute the Green’s functions in R directly 


from the regularization operators given in Fourier space. 
4.5.2 Regularization Properties of Kernels in R 


We now give some examples of kernels typically used in SVMs, this time in RY. 
We must first compute the Fourier /Hankel transform of the kernels. 


12 
Ne 202 r 


Example 4.13 (Gaussian RBFs) For Gaussian RBFs in N dimensions, k(r) = o7 
and correspondingly (as before we use the shorthand w := ||wl|]|), 


2 wo? wo 
E[k](w) = wo “NH, res] (w) = w g2HtDN eA = eT, 


In other words, the Fourier transform of a Gaussian is also a Gaussian, in higher dimen- 
sions. 
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Example 4.14 (Exponential RBFs) In the case of k(r) = e=", 
Fik (w) = w™H, [e] (w) (4.55) 


-pag 
=W VH anr T (v + 8) (P+) 
N — Not 
=22ar >T (X +1) (P +A? 
For N = 1 we recover the damped harmonic oscillator in the frequency domain. In general, 
a decay in the Fourier spectrum approximately proportional to w™®™+® can be observed. 
— Nol 
Moreover the Fourier transform of k, viewed itself as a kernel, k(r) = (1+1r?) ? , yields 
the initial kernel as its corresponding Fourier transform. 


Example 4.15 (Damped Harmonic Oscillator) Another way to generalize the har- 
monic oscillator, this time so that k does not depend on the dimensionality N, is to set 
k(r) = aim. Following [586, Section 13.6], 


=p re == =Y gY 
F[k](w) = w™”H, js = =| (w) = w "a" K (wa), (4.56) 
where K, is the Bessel function of the second kind, defined by (see [520]) 
00 

K(x) = f e=*osht cosh(vt) dt. (4.57) 

0 
It is possible to upper bound F[k] using 

pol oe ji T 1 

K,(x) = Tox Sayi Vt +8- oyr Le tet) : (4.58) 

Vax |& f(y —7 +3) jT e—p+s) 


with p > v — $ and 8 € [0,1] [209, eq. (8.451.6)]).The term in brackets [-] converges to 1 
as x —> oo, and thus results in an exponential decay of the Fourier spectrum. 


Example 4.16 (Modified Bessel Kernels) In the previous example, we defined a kernel 
via k(r) = are Since k(r) is a nonnegative function with acceptable decay properties. 
Therefore we could also use this function to define a kernel in Fourier space via v(w) = 


The consequence thereof is that (4.56) will now be a kernel, i.e., 


=l 
+w] 
k(r) := ra” K, (ra). (4.59) 
This is a popular kernel in Gaussian Process estimation [599] (see Section 16.3), since 
for v > n the corresponding Gaussian process is a mean-square differentiable stochastic 


processes. See [3] for more detail on this subject. For our purposes, it is sufficient to know 
that for v > n, k(||x — x'||) is differentiable in RN. 


Example 4.17 (Generalized B,, Splines) Finally, we generalize B,-splines to N di- 
mensions. One way is to define 


n 
BY := Q) lu, (4.60) 
j=0 


4.5 Translation Invariant Kernels in Higher Dimensions 109 


B, Splines + 
Bessel Functions 


AN: AN 
IK\ TANAN 
0.5 Ai 0.5 f iN 
f \\. \ 
NRN WN 
MAN OIN 
4 2 NNN 4 4 2 ZEISS 4 


-4 -4 


-4 -4 


Figure 4.9 B, splines in 2 dimensions. From left to right and top to bottom: Splines of order 
0 to 3. Again, note the increasing degree of smoothness and differentiability with increasing 
order of the splines. 


so that BY is the n + 1-times convolution of the indicator function of the unit ball Un 
in N dimensions. See Figure 4.9 for examples of such functions. Employing the Fourier- 
Plancherel Theorem (Theorem 4.11), we find that its Fourier transform is the (n + 1)st 
power of the Fourier transform of the unit ball, 


F[BO Iw) = ow“) J yaw), (4.61) 
and therefore, 


FIB (w) = wT ea), (4.62) 


Only odd n generate positive definite kernels, since it is only then that the kernel has a 
nonnegative Fourier transform. 


4.5.3 A Note on Other Invariances 


So far we have only been exploiting invariances with respect to the translation 
group in R^. The methods could also be applied to other symmetry transforma- 
tions with corresponding canonical coordinate systems, however. This means that 
we use a coordinate system where invariance transformations can be represented 
as additions. 
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Not all symmetries have this property. Informally speaking, those that do 
are called Lie groups (see also Section 11.3), and the parameter space where the 
additions take place is referred to as a Lie algebra. For instance, the rotation and 
scaling group (i.e. the product between the special orthogonal group SO(N) and 
radial scaling), as proposed in [487, 167], corresponds to a log-polar parametriza- 
tion of RY. The matching transform into frequency space is commonly referred to 
as the Fourier-Mellin transform [520]. 


4.6 Dot Product Kernels 


Regularization 
Properties via 
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A second, important family of kernels can be efficiently described in term of dot 
products, i.e., 


k(x, x’) = k ((x,x')). (4.63) 


Here, with slight abuse of notation we use k to define dot product kernels via 
k({x,x’)). Such dot product kernels include homogeneous and inhomogeneous 
polynomial kernel ((x,x’) +c)? with c > 0. Proposition 2.1 shows that they satisfy 
Mercer’s condition. 

What we will do in the following is state an easily verifiable criterion, under 
which conditions a general kernel, as defined by (4.63), will satisfy Mercer’s con- 
dition. A side-effect of this analysis will be a deeper insight into the regularization 
properties of the operator Y*Y, when considered on the space Lj(Sy-1), where 
Sy-1 is the unit sphere in RV. The choice of the domain Sy-1 is made in order to 
exploit the symmetries inherent in k: k(x, x’) is rotation invariant in its arguments 
cane ae 

In a nutshell, we use Mercer’s Theorem (Theorem 2.10) explicitly to obtain an 
expansion of k in terms of the eigenfunctions of the integral operator Tp (2.38) 
corresponding to k. For convenience, we briefly review the connection between 
Ty, the eigenvalues A;, and kernels k. 

For a given kernel k, the integral operator (T;f)(x) := fy k(x, x) f(x’) d(x’) can 
be expanded into its eigenvector decomposition (Aj, #;(x)), such that 


K(x, x) = EAEG’) (4.64) 
J 


holds. Furthermore, the eigensystem of the regularization operator Y*Y is given 
by Or" wi(x)). The latter tells us the preference of a kernel expansion for specific 
types of functions (namely the eigenfunctions 7;), and the smoothness assump- 
tions made via the size of the eigenvalues 4;: for instance, large values of A; corre- 
spond to functions that are weakly penalized. 
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4.6.1 Conditions for Positivity and Eigenvector Decompositions 


In the following we assume that X is the unit sphere Sy_1 C RY and that u is the 
uniform measure on Sy-1. This takes advantage of the inherent invariances in dot 
product kernels and it simplifies the notation. We begin with a few definitions. 


Legendre Polynomials Denote by ?,,(€) the Legendre Polynomials of degree n 
and by PN (£) the associated Legendre Polynomials (see [373] for more details and 
examples), where ?;, = 3. Without stating the explicit functional form we list 
some properties we will need: 


1. The (associated) aad Polynomials form an orthogonal basis with 
N 2 [Sn] 1 

S ROman- ey a= E ri (4.65) 
Here |Sn-1| = Pa PET denotes the surface of Sy_1, and M(N,n) denotes the 
multiplicity of spherical harmonics of order n on Sy_1, which is given by 
M(N, n)= a 2 (PENS), 

2. We can find an expansion for any analytic function k(€) on [—1,1] into 
PeR basis functions PN, nm 


KE) = È MN) EE [EPMO — eF a. (4.6) 
n=0 


3. The Legendre no may be expanded into an orthonormal basis of 
spherical harmonics Y j by the Funk-Hecke equation (see [373]), to obtain 


ISn 1| M(N,n) 


YN (x 4.67 
E ene «sn 
The explicit functional form of Yo f is not important for the further analysis. 


OF (ee) = 


Necessary and Sufficient Conditions Below we list conditions, as proven by 
Schoenberg [466], under which a function k((x,x’)), defined on Sy-1, is positive 
definite. In particular, he proved the following two theorems: 


Theorem 4.18 (Dot Product Kernels in Finite Dimensions) A kernel 
k(x, x')) defined on Sy-1 X Sy- is positive definite if and only if its expansion into 
Legendre polynomials PY has only nonnegative coefficients, i.e. 


k(£) = by bn P(E) with by > 0. (4.68) 


Theorem 4.19 (Dot Product Kernels in Infinite Dimensions) A kernel k( (x, x’)) 
defined on the unit sphere in a Hilbert space is positive definite if and only if its Tay- 


8. Typically, computer algebra programs can be used to find such expansions for given 
kernels k. This greatly reduces the problems in the analysis of such kernels. 
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lor series expansion has only nonnegative coefficients; 
[e,2) 
k(€) = X ané" with ay > 0. (4.69) 
n=0 
Therefore, all we have to do in order to check whether a particular kernel may 
satisfy Mercer’s condition, is to look at its polynomial series expansion, and check 
the coefficients. 
We note that (4.69) is a more stringent condition than (4.68). In other words, in 
order to prove positive definiteness for arbitrary dimensions it suffices to show 
that the Taylor expansion contains only positive coefficients. On the other hand, 
in order to prove that a candidate for a kernel function will never be positive 
definite, it is sufficient to show this for (4.68) where PY = P,,, i.e. for the Legendre 
Polynomials. 


Eigenvector Decomposition We conclude this section with an explicit representa- 
tion of the eigensystem of k((x, x’)). For a proof see [511]. 


Lemma 4.20 (Eigenvector Decomposition of Dot Product Kernels) Denote by 
k((x, x’)) a kernel on Sy—1 X Sy- satisfying condition (4.68) of Theorem 4.18. Then the 
eigenvectors of k are given by 


Wn,j = Wa with eigenvalues An j = tL of multiplicity M(N, n). (4.70) 
%— determines the regularization properties of k((x, x’). 


In other words, MN) 


4.6.2 Examples and Applications 


In the following we will analyze a few kernels, and state under which conditions 
they may be used as SV kernels. 


Example 4.21 (Homogeneous Polynomial Kernels k(x, x^) = (x,x’)’) As we showed 
Chapter 2, this kernel is positive definite for p € N. We will now show that for p ¢ N this 
is never the case. 

We thus have to show that (4.68) cannot hold for an expansion in terms of Legendre 
Polynomials (N = 3). From [209, 7.126.1], we obtain for k(£) = ||? (we need |€| to make 
k well-defined), 


1 
I P EEN AdE = So a aa 
-1 V+ soil (g+e+s) 
For odd n, the integral vanishes, since P,(—€) = (—1)" P, (£). In order to satisfy (4.68), 
the integral has to be nonnegative for all n. One can see that T (1+ Ẹ — 4) is the only 
term in (4.71) that may change its sign. Since the sign of the T function alternates with 
period 1 for x < 0 (and has poles for negative integer arguments), we cannot find any p 
for which n = 2|£ +1] and n = 2[ È +1] correspond to positive values of the integral. 


if n even. (4.71) 
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Example 4.22 (Inhomogeneous Polynomial Kernels k(x, x’) = (x, x’) + 1)?) Like- 
wise, let us analyze k(€) = (1+ €)? for p > 0. Again, we expand k in a series of Legendre 
Polynomials, to obtain [209, 7.127] 


1 z 2PHT?(p +1) 
[Pe + rae = ESE TET 


For p € N all terms with n > p vanish, and the remainder is positive. For non-integer p, 
however, (4.72) may change its sign. This is due to T(p +1 — n). In particular, for any 
p Z N (with p > 0), we have T(p +1 — n) <0 for n = [p] +1. This violates condition 
(4.68), hence such kernels cannot be used in SV machines unless p € N. 


(4.72) 


Example 4.23 (Vovk’s Real Polynomial k(x, y) = a with p € N [459]) This 
kernel can be written as k(€) = 5 €", hence all the coefficients a; = 1, which means that 


the kernel can be used regardless of the dimensionality of the input space. 


Likewise we can analyze an infinite power series. 


Example 4.24 (Vovk’s Infinite Polynomial k(x, x’) = (1 — (x, x’)))~! [459]) This 
kernel can be written as k(€) = 72-9 €", hence all the coefficients a; = 1. The flat spectrum 
of the kernel suggests poor generalization properties. 


Example 4.25 (Neural Network Kernels k(x, x’) = tanh(a + (x, x'))) We next show 
that k(€) = tanh(a + £) is never positive definite, no matter how we choose the parameters. 

The technique is identical to that of Examples 4.21 and 4.22: we have to show that the 
kernel does not satisfy the conditions of Theorem 4.18. Since this is very technical (and is 
best done using computer algebra programs such as Maple), we refer the reader to [401] for 
details, and explain how the method works in the simpler case of Theorem 4.19. Expanding 
tanh(a + £) into a Taylor series yields 


tanha + €—1_ — ¢?tanhe _ £(1 — tanh? a)(1 — 3 tanh? a) + O(E"). (4.73) 


cosh” a cosh” a 
We now analyze (4.73) coefficient-wise. Since the coefficients have to be nonnegative, 
we obtain a € [0,00) from the first term, a € (—o0,0] from the third term, and |a| € 
[arctanh },arctanh 1] from the fourth term . This leaves us with a € Q, hence there are no 
parameters for which this kernel is positive definite. 


4.7 Multi-Output Regularization 


So far in this chapter we only considered scalar functions f : X — Y. Below we 
will show that under rather mild assumptions on the symmetry properties of Y, 
there exist no other vector valued extensions to Y*Y than the trivial extension, i.e., 
the application of a scalar regularization operator to each of the dimensions of Y 
separately. The reader not familiar with group theory may want to skip the more 
detailed discussion given below. 
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The type of regularization we study are quadratic functionals Q[f]. Ridge re- 
gression, RKHS regularizers and also Gaussian Processes are examples of such 
regularization. Our proofs rely on a result from [509] which is stated without proof. 


Proposition 4.26 (Homogeneous Invariant Regularization [509]) Any regulariza- 
tion term Q[f] that is both homogeneous quadratic, and invariant under an irreducible 
orthogonal representation p of the group? G on Y; i.e., that satisfies 


Q[f] > 0 for all f EF, (4.74) 
Qlaf] = la| QLf] for all scalars a, (4.75) 
Q[p(g)f] = Q[f] for all g € S, (4.76) 


is of the form 
Q[f] = (XF, Vf), where Y is a scalar operator. (4.77) 


The motivation for the requirements (4.74) to (4.76) can be seen as follows: the 
necessity that a regularization term be positive (4.74) is self evident — it must 
at least be bounded from below. Otherwise we could obtain arbitrarily “good” 
estimates by exploiting the pathological behavior of the regularization operator. 
Hence, via a positive offset, Q[f] can be transformed such that it satisfies the 
positivity condition (4.74). 

Homogeneity (4.75) is a useful condition for efficient capacity control — it 
allows easy capacity control by noting that the entropy numbers (a quantity to 
be introduced in Chapter 12), which are a measure of the size of the set of possible 
solutions, scale in a linear (hence, homogeneous) fashion when the hypothesis 
class is rescaled by a constant. Practically speaking, this means that we do not 
need new capacity bounds for every scale the function f might assume. The 
requirement of being quadratic is merely algorithmic, as it allows to avoid taking 
absolute values in the linear or cubic case to ensure positivity, or when dealing 
with derivatives. 

Finally, the invariance must be chosen beforehand. If it happens to be sufficiently 
strong, it can rule out all operators but scalars. Permutation symmetry is such a 
case; in classification, for instance, this would mean that all class labels are treated 
equally. 

A consequence of the proposition is that there exists no vector valued regu- 
larization operator satisfying the invariance conditions. We now look at practical 
applications of Proposition 4.26, which will be stated in the form of corollaries. 


Corollary 4.27 (Permutation and Rotation Symmetries) Under the assumptions of 
Proposition 4.26, both the canonical representation of the permutation group (by per- 
mutation matrices) in a finite dimensional vector space Y, and the group of orthogonal 
transformations on Y, enforce scalar operators Y. 


9. Galso may be directly defined on Y, i.e. it might be a matrix group like SU(N). 
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This follows immediately from the fact that both rotations and permutations (or 
more precisely their representations on Y), are unitary and irreducible on Y by 
construction. For instance if the permutation group was reducible on Y, then there 
would exist subspaces on Y which do not change under any permutation on Y. 
This is impossible, however, since we are considering the group of all possible 
permutations over Y. Finally, permutations are a subgroup of the group of all 
possible orthogonal transformations. 

Let us now address the more practical side of such operators, namely how they 
translate into function expansions. We need only evaluate (Yaf,a’f'), where f, f’ 
are scalar function and a, a’ € Y. Since Y is also scalar, this yields (a, a’) (Vf, Xf’). 
It then remains to evaluate Q[f] for a kernel expansion of f. We obtain: 


Corollary 4.28 (Kernel Expansions) Under the assumptions of proposition 4.26, the 
regularization functional Q[f] for a kernel expansion 


f(x)= > ajk(x;,x), with a; € Y, (4.78) 


where k(x;,x) is a function mapping X x X to the space of scalars S, compatible with the 
dot product space Y (we require that Ba € Y for a € Y and P € S) can be stated 


Q[f] = 2 (ai, aj) (Yk(x;, -), Rey, -)) ; (4.79) 
1J 


In particular, if k is the Green’s function of Y*Y, we get 
Olf] = F (ai, a) klex). (4.80) 


i,j 


For possible applications such as regularized principal manifolds, see Chapter 17. 
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In some cases, we may have additional knowledge about the solution we are going 
to encounter. In particular, we may know that a specific parametric component 
is very likely going to be part of the solution. It would be unwise not to take 
advantage of this extra knowledge. For instance, it might be the case that the major 
properties of the data are described by a combination of a small set of linearly 
independent basis functions {¢1(-),...,¢n(-)}. Or we might want to correct the 
data for some (e.g. linear) trends. Second, it may also be the case that the user 
wants to have an understandable model, without sacrificing accuracy. Many people 
in life sciences tend to have a preference for linear models. These reasons motivate 
the construction of semiparametric models, which are both easy to understand (due 
to the parametric part) and perform well (often thanks to the nonparametric term). 
For more advantages and advocacy on semiparametric models, see [47]. 

A common approach is to fit the data with the parametric model and train the 
nonparametric add-on using the errors of the parametric part; that is, we fit the 
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nonparametric part to the errors. We will show that this is useful only in a very 
restricted situation. In general, this method does not permit us to find the best 
model amongst a given class for different loss functions. It is better instead to 
solve a convex optimization problem, as in standard SVMs, but with a different 
set of admissible functions; 


F(x) = (2) + ¥ Bid. (4.81) 
i=1 


Here g € H, where H is a Reproducing Kernel Hilbert Space as used in Theo- 
rem 4.3. In particular, this theorem implies that there exists a mixed expansion in 
terms of kernel functions k(x;, x) and the parametric part ¢j. 

Keeping the standard regularizer Q[f] = 4||f||3,, we can see that there exist 
functions ¢1(-),..., @n(-) whose contribution is not regularized at all. This need not 
be a major concern if n is sufficiently smaller than m, as the VC dimension (and 
thus the capacity) of this additional class of linear models is n, hence the overall 
capacity control will still work, provided the nonparametric part is sufficiently 
restricted. 

We will show, in the case of SV regression, how the semiparametric setting trans- 
lates into optimization problems. The application to classification is straightfor- 
ward, and is left as an exercise (see Problem 4.8). 

Formulating the optimization equations for the expansion (4.81), using the e- 
insensitive loss function, and introducing kernels, we arrive at the following pri- 
mal optimization problem: 


maximize à||w]|? + x E+, 
(w, Weed) + È Bip) -yi 
subject to yim (w W(x) — È Biba < etċ; 
Si Si 7 


A 
MA 
+ 


(4.82) 


A 


IV 
(= 


Computing the Lagrangian (we introduce aj, &¥, ni, n for the constraints) and 
solving for the Wolfe dual, yields!0 
m 


-$ i Z (a; = ar) (Qj = aï )k(xi, xj), 


maximize a 
=e $ (ai + až) + (Qj — až), 
2 ) ži yil ) (4.83) 
. È (ai — až)ġ;(x) = Oforalll<j<n, 
subject to i=1 


Qi, OF € [0,1/A]. 


10. See also (1.26) for details how to formulate the Lagrangian. 
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Figure 4.10 Backfitting of a model with two parameters, f(x) = wx + 8. Data was gener- 


ated by taking 10 samples from the uniform distribution on [4,3]. The target values were 


obtained by the dependency y; = x;. From left to right: (left) best fit with the paramet- 
ric model of a constant function; (middle) after adaptation of the second parameter while 
keeping the first parameter fixed; (right) optimal fit with both parameters. 


Note the similarity to the standard SV regression model. The objective function, 
and the box constraints on the Lagrange multipliers aj, až, remain unchanged. 
The only modification comes from the additional un-regularized basis functions. 
Instead of a single (constant) function ġı(x) = 1 as in the standard SV case, we 
now have an expansion in the basis (¢;(-). This gives rise to n constraints instead 
of one. Finally, f can be found as 


m 


f(x) = Saaka S Bidi(x) since w = Sa — af h(x). 
i=1 i=1 


i=1 


(4.84) 


The only difficulty remaining is how to determine {;. This can be done by ex- 
ploiting the Karush-Kuhn-Tucker optimality conditions in an analogous manner 
to (1.30), or more easily, by using an interior point algorithm (Section 6.4). In the 
latter case, the variables 3; can be obtained as the dual variables of the dual (dual 
dual = primal) optimization problem (4.83), as a by-product of the optimization 
process. 

It might seem that the approach presented above is quite unnecessary, and 
overly complicated for semiparametric modelling. In fact, we could try to fit the 
data to the parametric model first, and then fit the nonparametric part to the 
residuals; this approach is called backfitting. In most cases, however, this does not 
lead to the minimum of the regularized risk functional. We will show this using a 
simple example. 

Consider a SV regression machine as defined in Section 1.6, with linear kernel 
(ie. k(x, x’) = (x,x’)) in one dimension, and a constant term as parametric part 
(i.e. f(x) = wx + 8). Now suppose the data was generated by y; = x;, where x; 
is uniformly drawn from [3,3] without noise. Clearly, y; > 4 also holds for all i. 
By construction, the best overall fit of the pair (G,w) will be arbitrarily close to 
(0, 1) if the regularization parameter A is chosen sufficiently small. For backfitting, 
we first carry out the parametric fit, to find a constant 8 minimizing the term 

i~1¢(yi — 3). Depending on the chosen loss function c(-), 8 will be the mean (L»- 
error), the median (L,-error), a trimmed mean (related to the ¢-insensitive loss), or 
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some other function of the set {y1 — wx, .-., Yn — WXm} (cf. Section 3.4). Since all 
yi > 1, we have £8 > 1; this is not the optimal solution of the overall problem, since 
in the latter case 3 would be close to 0, as seen above. 

Hence backfitting does not minimize the regularized risk functional, even in the 
simplest of settings; and we certainly cannot expect backfitting to work in more 
complex cases. There exists only one case in which backfitting suffices, namely 
if the function spaces spanned by the kernel expansion {k(x;,-)} and {¢;(-)} are 
orthogonal. Consequently we must in general jointly solve for both the parametric 
and the nonparametric part, as done in (4.82) and (4.83). 

Above, we effectively excluded a set of basis functions ¢1,...,@, from being 
regularized at all. This means that we could use regularization functionals Q[f] 
that need not be positive definite on the whole Reproducing Kernel Hilbert Space 
H but only on the orthogonal complement to span {¢1,...¢n}. 

This brings us back to the notion of conditional positive definite kernels, as 
explained in Section 2.2. These exclude the space of linear functions from the space 
of admissible functions f, in order to achieve a positive definite regularization 
term Q[f] on the orthogonal complement. 

In (4.83), this is precisely what happens with the functions ¢;, which are not 
supposed to be regularized. Consequently, if we choose ¢j; to be the family of all 
linear functions, the semiparametric approach will allow us to use conditionally 
positive definite (cpd) kernels (see Definition 2.21 and below) without any further 
problems. 
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Most of the discussion in the current chapter was based on regularization in Re- 
producing Kernel Hilbert Spaces, and explicitly avoided any specific restrictions 
on the type of coefficient expansions used. This is useful insofar as it provides a 
powerful mathematical framework to assess the quality of the estimates obtained 
in this process. 

In some cases, however, we would rather use a regularization operator that acts 
directly on coefficient space, be it for theoretical reasons (see Section 16.5), or to 
satisfy the practical desire to obtain sparse expansions (Section 4.9.2); or simply by 
the heuristic that small coefficients generally translate into simple functions. 

We will now consider the situation where Q[ f] can be written as a function of the 
coefficients a;, where f will again be expanded as a linear combination of kernel 
functions, 


f(x) = Y. aik(x!, x) and OL f] = Qla], (4.85) 
i=1 


but with the possibility that x! and the training patterns x; do not coincide, and 
that possibly m # n. 
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4.9.1 Ridge Regression 


A popular choice to regularize linear combinations of basis functions is by a 
weight decay term (see [339, 49] and the references therein), which penalizes large 
weights. Thus we choose 


Tg 1 
I=, Loi =5llall’. (4.86) 


This is also called Ridge Regression [245, 377], and is a very common method in 
the context of shrinkage estimators. 

Similar to Section 4.3, we now investigate whether there exists a correspondence 
between Ridge Regression and SVMs. Although no strict equivalence holds, we 
will show that it is possible to obtain models generated by the same type of 
regularization operator. The requirement on an operator Y for a strict equivalence 
would be 


DACA, AK; aaj = = ¥ 07, (4.87) 


i, j= 1 i=1 


and thus, 
COUR tis) CKO; .)) = Oise (4.88) 


Unfortunately this requirement is not suitable for the case of the Kronecker ô, as 
(4.88) implies the functions (1k)(x;,+) to be elements of a non-separable Hilbert 
space. The solution is to change the finite Kronecker ô into the more appropriate 
6-distribution, i.e. d(x; — xj). 

By reasoning similar to Theorem 4.9, we can see that (4.88) holds, with k(x, x’) the 
Green’s function of Y. Note that as a regularization operator, (Y*Y)? is equivalent 
to Y, as we can always replace the latter by the former without any difference in 
the regularization properties. Therefore, we assume without loss of generality that 
Y is a positive definite operator. Formally, we require 


(Kx, -), AO; -)) = (Ox,(.), dx A. = Onan (4.89) 


Again, this allows us to connect regularization operators and kernels: the Green’s 
function of Y must be found in order to satisfy (4.89). For the special case of 
translation invariant operators represented in Fourier space, we can associate Y 
with Yryidge(w) as with (4.28), leading to 


2 
pegs [jee 


Yridge(w) 
This expansion is possible since the Fourier transform diagonalizes the corre- 
sponding regularization operator: repeated applications of Y become multipli- 
cations in the Fourier domain. Comparing (4.90) with (4.28) leads to the conclu- 
sion that the following relation between kernels for Support Vector Machines and 


(4.90) 
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Ridge Regression holds, 
Ysv(w) = [Yriage(w)|?. (4.91) 


In other words, in Ridge Regression it is the squared Fourier transform of the 
kernels that determines the regularization properties. Later on in Chapter 16, 
Theorem 16.9 will give a similar result, derived under the assumption that the 
penalties on a; are given by a prior probability over the distribution of expansion 
coefficients. 

This connection also explains the performance of Ridge Regression Models 
in a smoothing regularizer context (the squared norm of the Fourier transform 
of the kernel function describes its regularization properties), and allows us to 
“transform” Support Vector Machines to Ridge Regression models and vice versa. 
Note, however, that the sparsity properties of Support Vectors are lost. 


4.9.2 Linear Programming Regularization ((7') 


A squared penalty on the coefficients a; has the disadvantage that even though 
some kernel functions k(x;,x) may not contribute much to the overall solution, 
they still appear in the function expansion. This is due to the fact that the gradient 
of a? tends to 0 for a; +0 (this can easily be checked by looking at the partial 
derivative of Q[ f] wrt. a;). On the other hand, a regularizer whose derivative does 
not vanish in the neighborhood of 0 will not exhibit such problems. This is why 
we choose 


olf] = > lal. (4.92) 
i 
The regularized risk minimization problem can then be rewritten as 


minimize Rreglf] =A} |a;l + È (¿i+ EF), 
i=1 i=1 


yi— È ajk(xji x) — Dea —b < ete, 

m j=! n i a9) 
subject to È aik(x;,x)+ E olx+b—-yi < e+, 

j=l j=l 

Ent > 0. 


Besides replacing a; with a; — až, |a;| with a; + až, and requiring a;, až > 0, there 
is hardly anything that can be done to render the problem more computationally 
feasible — the constraints are already linear. Moreover most optimization software 
can deal efficiently with problems of this kind. 


4.9.3 Mixed Semiparametric Regularizers 
We now investigate the use of mixed regularization functionals, with different 


penalties for distinct parts of the function expansion, as suggested by equations 
(4.92) and (4.81). Indeed, we can construct the following variant, which is a mix- 
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ture of linear and quadratic regularizers, 
1 n 
QLfl = sllwil? + È l6: (4.94) 
i=1 


The equation above is essentially the SV estimation model, with an additional lin- 
ear regularization term added for the parametric part. In this case, the constraints 
on the optimization problem (4.83) become 


-1 < 5 (pus (x;) < 1 foralll <<j<n, 
í XK bai) < <j< (4.95) 
Qi, OF € [0,1/A], 


and the variables (; are obtained as the dual variables of the constraints, as dis- 
cussed previously in similar cases. Finally, we could reverse the setting to obtain a 
regularizer, 


m n 
olf] = 2 lai — a; | + ; 2 BiBiMij, (4.96) 
1= 1,]J= 
for some positive definite matrix M. Note that (4.96) can be reduced to the case of 
(4.94) by renaming variables accordingly, given a suitable choice of M. 

The proposed regularizers are a simple extension of existing methods such as 
Basis Pursuit [104], or Linear Programming for classification (e.g. [184]). The com- 
mon idea is to have two different sets of basis functions which are regularized 
differently, or a subset that is not regularized at all. This is an efficient way of en- 
coding prior knowledge or user preference, since the emphasis is on the functions 
with little or no regularization. 

Finally, one could also use a regularization functional Q[f] = ||a||o which simply 
counts the number of nonzero terms in the vector a € R”, or alternatively, combine 
this regularizer with the 44 norm to obtain Q[f] = ||allo + ||a||1. This is a concave 
function in a, which, in combination with the soft-margin loss function, leads to 
an optimization problem which is, as a whole, concave. Therefore one may apply 
Rockafellar’s theorem (Theorem 6.12) to obtain an optimal solution. See [189] for 
further details and an explicit algorithm. 


A connection between Support Vector kernels and regularization operators has 
been established, which can provide one key to understanding why Support Vec- 
tor Machines have been found to exhibit high generalization ability. In particular, 
for common choices of kernels, the mapping into feature space is not arbitrary, but 
corresponds to useful regularization operators (see Sections 4.4.1, 4.4.2 and 4.4.4). 
For kernels where this is not the case, Support Vector Machines may show poor 
performance (Section 4.4.3). This will become more obvious in Section 12, where, 
building on the results of the current chapter, the eigenspectrum of integral opera- 
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tors is connected with generalization bounds of the corresponding Support Vector 
Machines. 

The link to regularization theory can be seen as a tool for determining the struc- 
ture, consisting of sets of functions, in which Support Vector Machines and other 
kernel algorithms (approximately) perform structural risk minimization [561], 
possibly in a data dependent manner. In other words, it allows us to choose an 
appropriate kernel given the data and the problem specific knowledge. 

A simple consequence of this link is a Bayesian interpretation of Support Vector 
Machines. In this case, the choice of a special kernel can be regarded as a prior on 
the hypothesis space, with P[f] « exp(—A||Yf]||*). See Chapter 16 for more detail 
on this matter. 

It should be clear by now that the setting of Tikhonov and Arsenin [538], whilst 
very powerful, is certainly not the only conceivable one. A theorem on vector val- 
ued regularization operators showed, however, that under quite generic condi- 
tions on the isotropy of the space of target values, only scalar operators are possi- 
ble; an extended version of their approach is thus the only possible option. 

Finally a closer consideration of the null space of regularization functionals 
Q[f] led us to formulate semiparametric models. The roots of such models lie in 
the representer theorem (Theorem 4.2), proposed and explored in the context of 
smoothing splines in [296]. In fact, the SV expansion is a direct consequence of the 
representer theorem. 

Moreover the semiparametric setting solves a problem created by the use of con- 
ditionally positive definite kernels of order q (see Section 2.4.3). Here, polynomials 
of order lower than q are excluded. Hence, to cope with this effect, we must add 
polynomials back in “manually.” The semiparametric approach presents a way of 
doing that. Another application of semiparametric models, besides the conven- 
tional approach of treating the nonparametric part as nuisance parameters [47], is in 
the domain of hypothesis testing, for instance to test whether a parametric model 
fits the data sufficiently well. This can be achieved in the framework of structural 
risk minimization [561] — given the different models (nonparametric vs. semi- 
parametric vs. parametric), we can evaluate the bounds on the expected risk, and 
then choose the model with the best bound. 


4.1 (Equivalent Optimization Strategies eee) Denote by S a metric space and by 
R,Q: S — R two strictly convex continuous maps. Let A > 0. 


= Show that the map f + R[f] + AQLf] has only one minimum and a unique minimizer. 
Hint: assume the contrary and consider a straight line between two minima. 

= Show that for every A > 0, there exists an Q, such that minimization of RI f] + AQLf], 
is equivalent to minimizing RI f] subject to QL f] < Q). Show that an analogous statement 
holds with R and Q exchanged. Hint: consider the minimizer of RI f] + AQLf], and keep 
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the second term fixed while minimizing over the first term. 


m Consider the parametrized curve (Q(A), R(A)). What is the shape of this curve? Show 
that (barring discontinuities) — A is the tangent on the curve. 


m Consider the parametrized curve (In Q(A), In R(A)) as proposed by Hansen [225]. Show 
that a tangent criterion similar to that imposed above is scale insensitive wrt. Q and R. 
Why is this useful? What are the numerical problems with such an ansatz? 


4.2 (Orthogonality and Span ee) Show that the second condition of Definition 2.9 is 
equivalent to requiring 


(f, klx, -)}ac = O for all x E€ X 4> f =0. (4.97) 


4.3 (Semiparametric Representer Theorem ee) Prove Theorem 4.3. Hint: start with 
a decomposition of f into a parametric part, a kernel part, and an orthogonal contribution 
and evaluate the loss and regularization terms independently. 


4.4 (Kernel Boosting eee) Show that for f € H and c(x, y, f(x)) = exp(—yf(x)), you 
can develop a boosting algorithm by performing a coefficient-wise gradient descent on 
the coefficients œ; of the expansion f(x) = Yih, aik(x;, x). In particular, show that the 
expansion above is optimal. 

What changes if we drop the regularization term QU f] = ||f||?? See [498, 577, 221] for 
examples. 


4.5 (Monotonicity of the Regularizer ee) Give an example where, due to the fact that 
Qf] is not strictly monotonic the kernel expansion (4.5) is not the only minimizer of the 
regularized risk functional (4.4). 


4.6 (Sparse Expansions ee) Show that it is a sufficient requirement for the coefficients 
a; of the kernel expansion of the minimizer of (4.4) to vanish, if for the corresponding loss 
functions c(x;, yi, f(x;)) both the lhs and the rhs derivative with respect to f(x;) vanish. 
Hint: use the proof strategy of Theorem 4.2. 

Furthermore show that for loss functions c(x, y, f(x)) this implies that we can obtain 
vanishing coefficients only if c(x;, yi, f(xi)) = 0. 


4.7 (Biased Regularization ee) Show that for biased regularization (Remark 4.4) with 
(IF lla) = +f lZ the effective overall regularizer is given by 4||f — foll?. 


4.8 (Semiparametric Classification ee) Show that given a set of parametric basis func- 
tions ġ;, the optimization problem for SV classification has the same objective function as 
(1.31), however with the constraints [506] 


m 


0< a; < C for alli € [m] and by aiyid (xi) = 0 for all j. (4.98) 
i=l 


What happens if you combine semiparametric classification with adaptive margins (the 
v-trick)? 
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4.9 (Regularization Properties of Kernels e) Analyze the regularization properties of 
the Laplacian kernel k(x, x!) = e~|*-*'l. What is the rate of decay in its power spectrum? 
What is the kernel corresponding to the operator 


[PAP = IFIP + Mfl? + Ias? (4.99) 


Hint: rewrite X in the Fourier domain. 


4.10 (Periodizing the Laplacian Kernel e) Show that for the Laplacian kernel k(x, x") = 
ell, the periodization with period a results in a kernel proportional to 


[Ix-*' mod a] -[ix-x1 mod ata 


Aw x) =e +e (4.100) 
4.11 (Hankel Transform and Inversion eee) Show that for radially symmetric func- 
tions, the Fourier transform is given by (4.52). Moreover use (4.51) to prove the Hankel 


inversion theorem, stating that H, is its own inverse. 


4.12 (Eigenvector Decompositions of Polynomial Kernels eee) Compute the eigen- 
values of polynomial kernels on Uy. Hint: use [511] and separate the radial from the 
angular part in the eigenvector decomposition of k, and solve the radial part empirically 
via numerical analysis. Possible kernels to consider are Vovk’s kernel, (in)homogeneous 
polynomials and the hyperbolic tangent kernel. 


4.13 (Necessary Conditions for Kernels ee) Burges [86] shows, by using differential 
geometric methods, that a necessary condition for a differentiable translation invariant 
kernel k(x, x!) = k(||x — x'||?) to be positive definite is 


k(0) > 0 and k'(0) <0. (4.101) 
Prove this using functional analytic methods. 

4.14 (Mixed Semiparametric Regularizers ee) Derive (4.96). Hint: set up the primal 
optimization problem as described in Section 1.4, compute the Lagrangian, and eliminate 


the primal variables. 
Can you find an interpretation of (4.95)? What is the effect of X; (ai — a7 )b (xj)? 


Overview 
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We now give a more complete exposition of the ideas of statistical learning theory, 
which we briefly touched on in Chapter 1. We mentioned previously that in order 
to learn from a small training set, we should try to explain the data with a model of 
small capacity; we have not yet justified why this is the case, however. This is the 
main goal of the present chapter. 

We start by revisiting the difference between risk minimization and empirical 
risk minimization, and illustrating some common pitfalls in machine learning, 
such as overfitting and training on the test set (Section 5.1). We explain that the 
motivation for empirical risk minimization is the law of large numbers, but that 
the classical version of this law is not sufficient for our purposes (Section 5.2). 
Thus, we need to introduce the statistical notion of consistency (Section 5.3). It turns 
out that consistency of learning algorithms amounts to a law of large numbers, 
which holds uniformly over all functions that the learning machine can implement 
(Section 5.4). This crucial insight, due to Vapnik and Chervonenkis, focuses our 
attention on the set of attainable functions; this set must be restricted in order to 
have any hope of succeeding. Section 5.5 states probabilistic bounds on the risk 
of learning machines, and summarizes different ways of characterizing precisely 
how the set of functions can be restricted. This leads to the notion of capacity 
concepts, which gives us the main ingredients of the typical generalization error 
bound of statistical learning theory. We do not indulge in a complete treatment; 
rather, we try to give the main insights to provide the reader with some intuition 
as to how the different pieces of the puzzle fit together. We end with a section 
showing an example application of risk bounds for model selection (Section 5.6). 

The chapter attempts to present the material in a fairly non-technical manner, 
providing intuition wherever possible. Given the nature of the subject matter, 
however, a limited amount of mathematical background is required. The reader 
who is not familiar with basic probability theory should first read Section B.1. 


Let us start with an example. We consider a regression estimation problem. Sup- 
pose we are given empirical observations, 


(x1, Y1), -< -3 (Xm, Ym) EX X Y, (5.1) 
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where for simplicity we take X = Y = R. Figure 5.1 shows a plot of such a dataset, 
along with two possible functional dependencies that could underlie the data. The 
dashed line represents a fairly complex model, and fits the training data perfectly. 
The straight line, on the other hand, does not completely “explain” the data, in 
the sense that there are some residual errors; it is much “simpler,” however. A 
physicist measuring these data points would argue that it cannot be by chance 
that the measurements almost lie on a straight line, and would much prefer to 
attribute the residuals to measurement error than to an erroneous model. But is it 
possible to characterize the way in which the straight line is simpler, and why this 
should imply that it is, in some sense, closer to an underlying true dependency? 
In one form or another, this issue has long occupied the minds of researchers 
studying the problem of learning. In classical statistics, it has been studied as the 
bias-variance dilemma. If we computed a linear fit for every data set that we ever 
encountered, then every functional dependency we would ever “discover” would 
be linear. But this would not come from the data; it would be a bias imposed by 
us. If, on the other hand, we fitted a polynomial of sufficiently high degree to any 
given data set, we would always be able to fit the data perfectly, but the exact 
model we came up with would be subject to large fluctuations, depending on 


Figure 5.1 Suppose we want to estimate a 
functional dependence from a set of examples 
(black dots). Which model is preferable? The 
complex model perfectly fits all data points, 
whereas the straight line exhibits residual er- 
rors. Statistical learning theory formalizes the 
role of the complexity of the model class, and 
gives probabilistic guarantees for the validity 
of the inferred model. 
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how accurate our measurements were in the first place — the model would suffer 
from a large variance. A related dichotomy is the one between estimation error and 
approximation error. If we use a small class of functions, then even the best possible 
solution will poorly approximate the “true” dependency, while a large class of 
functions will lead to a large statistical estimation error. 

In the terminology of applied machine learning and the design of neural net- 
works, the complex explanation shows overfitting, while an overly simple expla- 
nation imposed by the learning machine design would lead to underfitting. A great 
deal of research has gone into clever engineering tricks and heuristics; these are 
used, for instance, to aid in the design of neural networks which will not overfit 
on a given data set [397]. In neural networks, overfitting can be avoided in a num- 
ber of ways, such as by choosing a number of hidden units that is not too large, by 
stopping the training procedure early in order not to enforce a perfect explanation 
of the training set, or by using weight decay to limit the size of the weights, and 
thus of the function class implemented by the network. 

Statistical learning theory provides a solid mathematical framework for study- 
ing these questions in depth. As mentioned in Chapters 1 and 3, it makes the as- 
sumption that the data are generated by sampling from an unknown underlying 
distribution P(x, y). The learning problem then consists in minimizing the risk (or 
expected loss on the test data, see Definition 3.3), 


RIAL= ff ey, foo) dP tx, y). (5.2) 


Here, c is a loss function. In the case of pattern recognition, where Y = {+1}, a 
common choice is the misclassification error, c(x, y, f(x)) = $|f(x) — yl. 

The difficulty of the task stems from the fact that we are trying to minimize a 
quantity that we cannot actually evaluate: since we do not know P, we cannot 
compute the integral (5.2). What we do know, however, are the training data (5.1), 
which are sampled from P. We can thus try to infer a function f from the training 
sample that is, in some sense, close to the one minimizing (5.2). To this end, we 
need what is called an induction principle. 

One way to proceed is to use the training sample to approximate the integral in 
(5.2) by a finite sum (see (B.18)). This leads to the empirical risk (Definition 3.4), 


m 


Rempl fl = È È elen yo fed), 63) 


i=] 
and the empirical risk minimization (ERM) induction principle, which recommends 
that we choose an f that minimizes (5.3). 

Cast in these terms, the fundamental trade-off in learning can be stated as 
follows: if we allow f to be taken from a very large class of functions F, we can 
always find an f that leads to a rather small value of (5.3). For instance, if we allow 
the use of all functions f mapping X > Y (in compact notation, F = 4%), then we 
can minimize (5.3) yet still be distant from the minimizer of (5.2). Considering a 
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pattern recognition problem, we could set 


(5.4) 


iaa Yi F A a 
1 otherwise. 


This does not amount to any form of learning, however: suppose we are now given 
a test point drawn from the same distribution, (x, y) ~ P(x, y). If X is a continuous 
domain, and we are not in a degenerate situation, the new pattern x will almost 
never be exactly equal to any of the training inputs x;. Therefore, the learning 
machine will almost always predict that y = 1. If we allow all functions from X to Y, 
then the values of the function at points x1,...,Xm carry no information about the values 
at other points. In this situation, a learning machine cannot do better than chance. 
This insight lies at the core of the so-called No-Free-Lunch Theorem popularized in 
[608]; see also [254, 48]. 

The message is clear: if we make no restrictions on the class of functions from 
which we choose our estimate f, we cannot hope to learn anything. Consequently, 
machine learning research has studied various ways to implement such restric- 
tions. In statistical learning theory, these restrictions are enforced by taking into 
account the complexity or capacity (measured by VC dimension, covering numbers, 
entropy numbers, or other concepts) of the class of functions that the learning ma- 
chine can implement.! 

In the Bayesian approach, a similar effect is achieved by placing prior distribu- 
tions P(f) over the class of functions (Chapter 16). This may sound fundamentally 
different, but it leads to algorithms which are closely related; and on the theoretical 
side, recent progress has highlighted intriguing connections [92, 91, 353, 238]. 


5.2 The Law of Large Numbers 


Let us step back and try to look at the problem from a slightly different angle. 
Consider the case of pattern recognition using the misclassification loss function. 
Given a fixed function f, then for each example, the loss £; := 3|f(x;) — y;| is either 


1. As an aside, note that the same problem applies to training on the test set (sometimes 
called data snooping): sometimes, people optimize tuning parameters of a learning machine 
by looking at how they change the results on an independent test set. Unfortunately, once 
one has adjusted the parameter in this way, the test set is not independent anymore. This 
is identical to the corresponding problem in training on the training set: once we have 
chosen the function to minimize the training error, the latter no longer provides an unbiased 
estimate of the test error. Overfitting occurs much faster on the training set, however, than 
it does on the test set. This is usually due to the fact that the number of tuning parameters 
of a learning machine is much smaller than the total number of parameters, and thus the 
capacity tends to be smaller. For instance, an SVM for pattern recognition typically has two 
tuning parameters, and optimizes m weight parameters (for a training set size of m). See 
also Problem 5.3 and [461]. 
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0 or 1 (provided we have a +1-valued function f), and all examples are drawn 
independently. In the language of probability theory, we are faced with Bernoulli 
trials. The &,...,€ are independently sampled from a random variable 


E= ifo- 6.5) 


A famous inequality due to Chernoff [107] characterizes how the empirical mean 
1 yi, & converges to the expected value (or expectation) of £, denoted by E(E): 


i 


Note that the P refers to the probability of getting a sample &,...,& with the 
property |4 5, & — E(€)| > e. Mathematically speaking, P strictly refers to a so- 
called product measure (cf. (B.11)). We will presently avoid further mathematical 
detail; more information can be found in Appendix B. 

In some instances, we will use a more general bound, due to Hoeffding (Theo- 
rem 5.1). Presently, we formulate and prove a special case of the Hoeffding bound, 
which implies (5.6). Note that in the following statement, the £; are no longer re- 
stricted to take values in {0,1}. 


1 m 
i di —E(¢) 


> e} < 2exp(—2me°) (5.6) 


Theorem 5.1 (Hoeffding [244]) Let &€;, i € [m] be m independent instances of a bounded 
random variable £, with values in [a,b]. Denote their average by Qm = + X; &. Then for 
anye >0, 


P{Qm — E(£) > e} Ime? 
2 E i 5.7 
P{E(E) — Qn > €} | AE ( (b — =) (5.7) 


The proof is carried out by using a technique commonly known as Chernoff’s 
bounding method [107]. The proof technique is widely applicable, and generates 
bounds such as Bernstein’s inequality [44] (exponential bounds based on the 
variance of random variables), as well as concentration-of-measure inequalities 
(see, e.g., [356, 66]). Readers not interested in the technical details underlying laws 
of large numbers may want to skip the following discussion. 

We start with an auxiliary inequality. 


Lemma 5.2 (Markov’s Inequality (e.g., [136])) Denote by & a nonnegative random 
variable with distribution P. Then for all A > 0, the following inequality holds: 


1 
P{¢ > ABO} < 5. (5.8) 
Proof Using the definition of E(€), we have 


B= f O> figi POZ ABO [dP = ABOPLE > AEE}. 
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Proof of Theorem 5.1. Without loss of generality, we assume that E(£) = 0 (other- 
wise simply define a random variable £ := € — E(£) and use the latter in the proof). 
Chernoff’s bounding method consists in transforming a random variable € into 
exp(s&) (s > 0), and applying Markov’s inequality to it. Depending on €, we can 
obtain different bounds. In our case, we use 


P{E > e} = P {exp(sé) > exp(se)} < e “E [exp(sé)] (5.9) 
Lge s m = m S 
=e E fe (: Se) <e IE exp (Ža) : (5.10) 


In (5.10), we exploited the fact that for positive random variables E [J]; &] < 
TI, E[&;]. Since the inequality holds independent of the choice of s, we may mini- 
mize over s to obtain a bound that is as tight as possible. To this end, we transform 
the expectation over exp (=£;) into something more amenable. The derivation is 


rather technical; thus we state without proof [244]: E [exp(+£j)] < exp (5E). 


8m2 
From this, we conclude that the optimal value of s is given by s = ane Substitut- 
ing this value into the right hand side of (5.10) proves the bound. a 


Let us now return to (5.6). Substituting (5.5) into (5.6), we have a bound which 
states how likely it is that for a given function f, the empirical risk is close to the 
actual risk, 


P{|Remplf] — RLfl] > €} < 2exp(—2me?). 6.11) 


Using Hoeffding’s inequality, a similar bound can be given for the case of regres- 
sion estimation, provided the loss c(x, y, f(x)) is bounded. 

For any fixed function, the training error thus provides an unbiased estimate 
of the test error. Moreover, the convergence (in probability) Remp[f] + R[f] as 
m — oo is exponentially fast in the number of training examples.* Although this 
sounds just about as good as we could possibly have hoped, there is one caveat: 
a crucial property of both the Chernoff and the Hoeffding bound is that they are 
probabilistic in nature. They state that the probability of a large deviation between 
test error and training error of f is small; the larger the sample size m, the smaller 
the probability. Granted, they do not rule out the presence of cases where the 
deviation is large, and our learning machine will have many functions that it can 
implement. Could there be a function for which things go wrong? It appears that 


2. Convergence in probability, denoted as 
[RempLf] — RIA] | 2, 0asm— ©, 
means that for all € > 0, we have 


lim P{|RempLf]— RIFI] > €} =0. 


m 
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we would be very unlucky for this to occur precisely for the function f chosen by 
empirical risk minimization. 

At first sight, it seems that empirical risk minimization should work — in 
contradiction to our lengthy explanation in the last section, arguing that we have 
to do more than that. What is the catch? 


5.3. When Does Learning Work: the Question of Consistency 


Consistency 


It turns out that in the last section, we were too sloppy. When we find a function 
f by choosing it to minimize the training error, we are no longer looking at 
independent Bernoulli trials. We are actually choosing f such that the mean of 
the €; is as small as possible. In this sense, we are actively looking for the worst 
case, for a function which is very atypical, with respect to the average loss (i.e., the 
empirical risk) that it will produce. 

We should thus state more clearly what it is that we actually need for empirical 
risk minimization to work. This is best expressed in terms of a notion that statisti- 
cians call consistency. It amounts to saying that as the number of examples m tends 
to infinity, we want the function f” that minimizes Remp[f] (note that f” need not 
be unique), to lead to a test error which converges to the lowest achievable value. 
In other words, f” is asymptotically as good as whatever we could have done if 
we were able to directly minimize R[f] (which we cannot, as we do not even know 
it). In addition, consistency requires that asymptotically, the training and the test 
error of f” be identical.’ 

It turns out that without restricting the set of admissible functions, empirical risk 
minimization is not consistent. The main insight of VC (Vapnik-Chervonenkis) 
theory is that actually, the worst case over all functions that the learning machine 
can implement determines the consistency of empirical risk minimization. In other 
words, we need a version of the law of large numbers which is uniform over all 
functions that the learning machine can implement. 


5.4 Uniform Convergence and Consistency 


The present section will explain how consistency can be characterized by a uni- 
form convergence condition on the set of functions J that the learning machine 
can implement. Figure 5.2 gives a simplified depiction of the question of consis- 
tency. Both the empirical risk and the actual risk are drawn as functions of f. For 


3. We refrain from giving a more formal definition of consistency, the reason being that 
there are some caveats to this classical definition of consistency; these would necessitate a 
discussion leading us away from the main thread of the argument. For the precise definition 
of the required notion of “nontrivial consistency,” see [561]. 
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Figure 5.2 Simplified depiction of the convergence of empirical risk to actual risk. The x- 
axis gives a one-dimensional representation of the function class; the y axis denotes the risk 
(error). For each fixed function f, the law of large numbers tells us that as the sample size 
goes to infinity, the empirical risk Remp[f] converges towards the true risk R[f] (indicated 
by the downward arrow). This does not imply, however, that in the limit of infinite sample 
sizes, the minimizer of the empirical risk, f”, will lead to a value of the risk that is as good as 
the best attainable risk, R[ f°P'] (consistency). For the latter to be true, we require convergence 
of Remp[f] towards R[f] to be uniform over all functions that the learning machines can 
implement (see text). 


simplicity, we have summarized all possible functions f by a single axis of the 
plot. Empirical risk minimization consists in picking the f that yields the minimal 
value of Remp. If it is consistent, then the minimum of Remp converges to that of R 
in probability. Let us denote the minimizer of R by f°?', satisfying 


RL] — RLfP*] > 0 (6.12) 


for all f € F. This is the optimal choice that we could make, given complete 
knowledge of the distribution P.4 Similarly, since f" minimizes the empirical risk, 
we have 


Remplf] = Remp f”] > 0, (5.13) 


for all f € F. Being true for all f € F, (5.12) and (5.13) hold in particular for f™ and 
foP'. If we substitute the former into (5.12) and the latter into (5.13), we obtain 


Rif" — RIf™®] > 0, (5.14) 
and 
Remp[ f] = Remp[f”] > 0. (5.15) 


4. As with f", f°?! need not be unique. 
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The sum of these two inequalities satisfies 


0< Rf] = RUF] T Rempl f™®"] = RempLf™] 
= Rif] = RempLf”] T Remp[ f] = RFP] 
< pas (RII = RempLf]) + Reaplf | = Rf]. (5.16) 
EF 


Let us first consider the second half of the right hand side. Due to the law of large 
numbers, we have convergence in probability, i.e., for all e > 0, 


[RempLf°P'] — RLf°P']| 5 0 as m > 00. (5.17) 


This holds true since f°?' is a fixed function, which is independent of the training 
sample (see (5.11)). 

The important conclusion is that if the empirical risk converges to the actual risk 
one-sided uniformly, over all functions that the learning machine can implement, 


sup(R[f] — RempLf]) — 0 as m — oo, (5.18) 
fEF 


then the left hand sides of (5.14) and (5.15) will likewise converge to 0; 
RIF] — RIf™] 0, (5.19) 
Remplf®*] — RempLf”] 5 0. (5.20) 


As argued above, (5.17) is not always true for f”, since f” is chosen to minimize 
Remp, and thus depends on the sample. Assuming that (5.18) holds true, however, 
then (5.19) and (5.20) imply that in the limit, R[f’] cannot be larger than Remp[f”]. 
One-sided uniform convergence on F is thus a sufficient condition for consistency 
of the empirical risk minimization over F.” 

What about the other way round? Is one-sided uniform convergence also a 
necessary condition? Part of the mathematical beauty of VC theory lies in the 
fact that this is the case. We cannot go into the necessary details to prove this 
[571, 561, 562], and only state the main result. Note that this theorem uses the 
notion of nontrivial consistency that we already mentioned briefly in footnote 3. 
In a nutshell, this concept requires that the induction principle be consistent even 
after the “best” functions have been removed. Nontrivial consistency thus rules 
out, for instance, the case in which the problem is trivial, due to the existence of a 
function which uniformly does better than all other functions. To understand this, 
assume that there exists such a function. Since this function is uniformly better 
than all others, we can already select this function (using ERM) from one (arbitrary) 
data point. Hence the method would be trivially consistent, no matter what the 


5. Note that the onesidedness of the convergence comes from the fact that we only require 
consistency of empirical risk minimization. If we required the same for empirical risk maxi- 
mization, then we would end up with standard uniform convergence, and the parentheses 
in (5.18) would be replaced with modulus signs. 
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rest of the function class looks like. Having one function which gets picked as soon 
as we have seen one data point would essentially void the inherently asymptotic 
notion of consistency. 


Theorem 5.3 (Vapnik & Chervonenkis (e.g., [562])) One-sided uniform convergence 
in probability, 

lim P{sup(R[f] — Remp[f]) > €} = 0, (5.21) 
mM—0o0 FEF 


for all e > 0, is a necessary and sufficient condition for nontrivial consistency of 
empirical risk minimization. 


As explained above, consistency, and thus learning, crucially depends on the set 
of functions. In Section 5.1, we gave an example where we considered the set of all 
possible functions, and showed that learning was impossible. The dependence of 
learning on the set of functions has now returned in a different guise: the condition 
of uniform convergence will crucially depend on the set of functions for which it 
must hold. 

The abstract characterization in Theorem 5.3 of consistency as a uniform con- 
vergence property, whilst theoretically intriguing, is not all that useful in practice. 
We do not want to check some fairly abstract convergence property every time 
we want to use a learning machine. Therefore, we next address whether there are 
properties of learning machines, i.e., of sets of functions, which ensure uniform 
convergence of risks. 


5.5 How to Derive a VC Bound 


We now take a closer look at the subject of Theorem 5.3; the probability 


P{sup(R[f] — RempIf]) > €}. (5.22) 
JEF 
We give a simplified account, drawing from the expositions of [561, 562, 415, 238]. 
We do not aim to describe or even develop the theory to the extent that would 
be necessary to give precise bounds for SVMs, say. Instead, our goal will be to 
convey central insights rather than technical details. For more complete treatments 
geared specifically towards SVMs, cf. [562, 491, 24]. We focus on the case of pattern 
recognition; that is, on functions taking values in {+1}. 
Two tricks are needed along the way: the union bound and the method of sym- 
metrization by a ghost sample. 


5.5.1 The Union Bound 


Suppose the set F consists of two functions, fı and fz. In this case, uniform 
convergence of risk trivially follows from the law of large numbers, which holds 
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for each of the two. To see this, let 


C: = 10843 Y1), cnc (Xm, Ym) (RL = RempI/fil) 2 e} (5.23) 


denote the set of samples for which the risks of f; differ by more than e. Then, by 
definition, we have 


P{sup(R[f] — Remp fl) > €} = P(C} U C3). (5.24) 
JEF 


The latter, however, can be rewritten as 
P(C: U C?) = P(C} + P(C?) — P(C: N C3) < P(CH) + P(C), (5.25) 


where the last inequality follows from the fact that P is nonnegative. Similarly, if 
F= {fi;..., fn}, we have 


Ppa f] — Remplf]) > €} = P(C} U... U CH) < > PIC). (5.26) 
Es i=1 
This inequality is called the union bound. As it is a crucial step in the derivation 
of risk bounds, it is worthwhile to emphasize that it becomes an equality if and 
only if all the events involved are disjoint. In practice, this is rarely the case, and 
we therefore lose a lot when applying (5.26). It is a step with a large “slack.” 
Nevertheless, when F is finite, we may simply apply the law of large numbers 
(5.11) for each individual P(C’), and the sum in (5.26) then leads to a constant factor 
n on the right hand side of the bound — it does not change the exponentially 
fast convergence of the empirical risk towards the actual risk. In the next section, 
we describe an ingenious trick used by Vapnik and Chervonenkis, to reduce the 
infinite case to the finite one. It consists of introducing what is sometimes called a 
ghost sample. 


5.5.2 Symmetrization 


The central observation in this section is that we can bound (5.22) in terms of 
a probability of an event referring to a finite function class. Note first that the 
empirical risk term in (5.22) effectively refers only to a finite function class: for 
any given training sample of m points x1,...,Xm, the functions of F can take at 
most 2” different values y1,..., Ym (recall that the y; take values only in {+1}). 
In addition, the probability that the empirical risk differs from the actual risk by 
more than e, can be bounded by the twice the probability that it differs from the 
empirical risk on a second sample of size m by more than €/2. 


Lemma 5.4 (Symmetrization (Vapnik & Chervonenkis) (e.g. [559])) For me? > 2, 
we have 


P{sup(R[f] — RempLf]) > €} < 2P{sup(Rempl f] — Rémpl f1) >e€/2}. (5.27) 
JEF JEF 


Here, the first P refers to the distribution of iid samples of size m, while the second one 
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refers to iid samples of size 2m. In the latter case, Remp measures the loss on the first half 
of the sample, and R!» on the second half. 


emp 


Although we do not prove this result, it should be fairly plausible: if the empirical 
error rates on two independent m-samples are close to each other, then they should 
also be close to the true error rate. 


5.5.3 The Shattering Coefficient 


The main result of Lemma 5.4 is that it implies, for the purpose of bounding (5.22), 
that the function class F is effectively finite: restricted to the 2m points appearing 
on the right hand side of (5.27), it has at most 2™ elements. This is because only 
the outputs of the functions on the patterns of the sample count, and there are 
2m patterns with two possible outputs, +1. The number of effectively different 
functions can be smaller than 22”, however; and for our purposes, this is the case 
that will turn out to be interesting. 

Let Zon i= (x1, Y1)5---5(Xom, Y2m)) be the given 2m-sample. Denote by N(F, Zam) 
the cardinality of F when restricted to {x1,...,X2m}, that is, the number of func- 
tions from F that can be distinguished from their values on {x1,...,X2m}. Let us, 
moreover, denote the maximum (over all possible choices of a 2m-sample) number 
of functions that can be distinguished in this way as N(F, 2m). 

The function N(F, mm) is referred to as the shattering coefficient, or in the more gen- 
eral case of regression estimation, the covering number of F. In the case of pattern 
recognition, which is what we are currently looking at, N(F, m) has a particularly 
simple interpretation: it is the number of different outputs (y1,..., Ym) that the 
functions in F can achieve on samples of a given size.” In other words, it simply 
measures the number of ways that the function class can separate the patterns into two 
classes. Whenever N(F, m) = 2", all possible separations can be implemented by 
functions of the class. In this case, the function class is said to shatter m points. 
Note that this means that there exists a set of m patterns which can be separated in 
all possible ways — it does not mean that this applies to all sets of m patterns. 


5.5.4 Uniform Convergence Bounds 


Let us now take a closer look at the probability that for a 2m-sample Zm drawn 
iid from P, we get a one-sided uniform deviation larger than €/2 (cf. (5.27)), 


P{sup(Remplf] — Rémpl fI) > €/2}. (5.28) 
fer 


6. In regression estimation, the covering number also depends on the accuracy within 
which we are approximating the function class, and on the loss function used; see Sec- 
tion 12.4 for more details. 

7. Using the zero-one loss c(x, y, f(x)) = 1/2|f(x) — y| € {0, 1}, it also equals the number of 
different loss vectors (c(x1, y1, f(X1)),- ++ 5C(Xms Ym, f(Xm)))- 
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The basic idea now is to pick a maximal set of functions {f,,..., fN(S Z) } that 
can be distinguished based on their values on Zyn, then use the union bound, and 
finally bound each term using the Chernoff inequality. However, the fact that the 
fi depend on the sample Zm will make things somewhat more complicated. To 
deal with this, we have to introduce an auxiliary step of randomization, using a 
uniform distibution over permutations ø of the 2m-sample Zzm. 

Let us denote the empirical risks on the two halves of the sample after the 
permutation o by R@,,,[f] and Réhplf], respectively. Since the 2m-sample is iid, 
the permutation does not affect (5.28). We may thus instead consider 


Pano {SUP (Remp t= Rip [f]) > e/2}, (5.29) 


where the subscripts of P were added to clarify what the distribution refers to. We 
next rewrite this as 


f Poteon£ SUP (RengLfl — Repl fD > €/2 | Zan} AP Zan (5.30) 

(Xx {4+1})" FEF | Zon 

We can now express the event C; := {a| SUP Fes, Boal l= Balle 2) as 
N(F Zam) 

Ce= |) Cap), (5.31) 


n=1 


where the events C.(fn) := {|(RempLfnl — Ronplfnl) > €/2} refer to individual 
functions fn chosen such that (U,,{f:}) |z., = Flza Note that the functions fn 
may be considered as fixed, since we have conditioned on Zm. 

We are now in a position to appeal to the classical law of large numbers. Our 
random experiment consists of drawing o from the uniform distribution over all 
permutations of 2m-samples. This turns our sequence of losses £7 = 5|f(x?) — y?| 
(i=1,...,2m) into an iid sequence of independent Bernoulli trials. We then apply 
a modified Chernoff inequality to bound the probability of each event C,(f;,). It 
states that given a 2m-sample of Bernoulli trials, we have (see Problem 5.4) 


rae ae x eze} szop (35). (5.32) 


i=m+1 


For our present problem, we thus obtain 


2 
Polza (Ce(fn)) < 2 exp (5) ; (5.33) 


independent of f„. We next use the union bound to get a bound on the probability 
of the event C, defined in (5.31). We obtain a sum over N(F, Zm) identical terms 
of the form (5.33). Hence (5.30) (and (5.29)) can be bounded from above by 


me? 
f NE, Zom) 2 exp (=E ) PZ 
(Xx {+1})2" 8 
2: 


=2 EINS, Zan)lexp (-25), (5.34) 


8 
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where the expectation is taken over the random drawing of Z2,,. The last step is to 
combine this with Lemma 5.4, to obtain 


2 
P{sup(R[f] = RempLf]) > e} <4E(N(G, Zon)] exp (=) 
FEF 


me? 
= 4 exp (mewa, Zom)] — mE) (5.35) 
We conclude that provided E[N(F, Z2m)] does not grow exponentially in m (i.e., 
In E[N(F, Z2m)] grows sublinearly), it is actually possible to make nontrivial state- 
ments about the test error of learning machines. 

The above reasoning is essentially the VC style analysis. Similar bounds can 
be obtained using a strategy which is more common in the field of empirical 
processes, first proving that sup ,(R[f] — RempI[f]) is concentrated around its mean 
[554, 14]. 


5.5.5 Confidence Intervals 


It is sometimes useful to rewrite (5.35) such that we specify the probability with 
which we want the bound to hold, and then get the confidence interval, which 
tells us how close the risk should be to the empirical risk. This can be achieved by 
setting the right hand side of (5.35) equal to some ô > 0, and then solving for e. As 
a result, we get the statement that with a probability at least 1 — 4, 


RIF] < Remplfl+ 4/2 (ink IN(F, Zon)] + In 3) (6.36) 


m ô 
Note that this bound holds independent of f; in particular, it holds for the function 
f” minimizing the empirical risk. This is not only a strength, but also a weakness 
in the bound. It is a strength since many learning machines do not truly minimize 
the empirical risk, and the bound thus holds for them, too. Itis a weakness since by 
taking into account more information on which function we are interested in, one 
could hope to get more accurate bounds. We will return to this issue in Section 12.1. 
Bounds like (5.36) can be used to justify induction principles different from the 
empirical risk minimization principle. Vapnik and Chervonenkis [569, 559] pro- 
posed minimizing the right hand side of these bounds, rather than just the em- 


pirical risk. The confidence term, in the present case, 4/ È (In E [N(F, Z2m)] +1n $), 
then ensures that the chosen function, denoted f,, not only leads to a small risk, 
but also comes from a function class with small capacity. 

The capacity term is a property of the function class F, and not of any individ- 
ual function f. Thus, the bound cannot simply be minimized over choices of f. 
Instead, we introduce a so-called structure on F, and minimize over the choice of 
the structure. This leads to an induction principle called structural risk minimiza- 
tion. We leave out the technicalities involved [559, 136, 562]. The main idea is 
depicted in Figure 5.3. 
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Figure 5.3 Graphical depiction of the structural risk minimization (SRM) induction prin- 
ciple. The function class is decomposed into a nested sequence of subsets of increasing size 
(and thus, of increasing capacity). The SRM principle picks a function f, which has small 
training error, and comes from an element of the structure that has low capacity h, thus 
minimizing a risk bound of type (5.36). 


For practical purposes, we usually employ bounds of the type (5.36) as a guide- 
line for coming up with risk functionals (see Section 4.1). Often, the risk functionals 
form a compromise between quantities that should be minimized from a statistical 
point of view, and quantities that can be minimized efficiently (cf. Problem 5.7). 

There exists a large number of bounds similar to (5.35) and its alternative form 
(5.36). Differences occur in the constants, both in front of the exponential and in 
its exponent. The bounds also differ in the exponent of e — in some cases, by a 
factor greater than 2. For instance, if a training error of zero is achievable, we can 
use Bernstein’s inequality instead of Chernoff’s result, which leads to e rather than 
e. For further details, cf. [136, 562, 492, 238]. Finally, the bounds differ in the way 
they measure capacity. So far, we have used covering numbers, but this is not the 
only method. 


5.5.6 The VC Dimension and Other Capacity Concepts 


So far, we have formulated the bounds in terms of the so-called annealed entropy 
InE[N(J, Z2m)]. This led to statements that depend on the distribution and thus 
can take into account characteristics of the problem at hand. The downside is 
that they are usually difficult to evaluate; moreover, in most problems, we do 
not have knowledge of the underlying distribution. However, a number of dif- 
ferent capacity concepts, with different properties, can take the role of the term 
In(E[N(F, Z2m)]) in (5.36). 


= Given an example (x, y), f € F causes a loss that we denote by c(x, y, f(x)) := 
f(x) — y| € {0,1}. For a larger sample (x1, yi)... , (Xm; Ym), the different functions 
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f €F lead to a set of loss vectors E; = (e(x1, Y1, f(x1)), <- <, C(Xm, Ym, f (Xm))), whose 
cardinality we denote by N (F, (x1, y1) - - - , (Xm, Ym)). The VC entropy is defined as 


Hg(m) = E [InN (F, (x1, y1) - - -, (Xm; Ym))] ; (5.37) 


where the expectation is taken over the random generation of the m-sample 


(x1, y1) - <- , (Xm, Ym) from P. 
One can show [562] that the convergence 


lim H.(m)/m =0, (5.38) 


m—00 


is equivalent to uniform (two-sided) convergence of risk, 
lim P{sup |RIf] — Remp[f]| > €} =0, (5.39) 
m—00 FEF 


for all € > 0. By Theorem 5.3, (5.39) thus implies consistency of empirical risk 
minimization. 

= If we exchange the expectation E and the logarithm in (5.37), we obtain the 
annealed entropy used above, 


HF” (m) =InE [N ($, (x1, yı) raia (Xm, Ym))] : (5.40) 


Since the logarithm is a concave function, the annealed entropy is an upper bound 

on the VC entropy. Therefore, whenever the annealed entropy satisfies a condition 

of the form (5.38), the same automatically holds for the VC entropy. 

One can show that the convergence 

lim HF” (m)/m = 0, (5.41) 
‘00 


m> 


implies exponentially fast convergence [561], 
P{sup |RIf] — RempLf]| > €} < 4 exp(((H3"(2m)/m) — ê) - m). (6.42) 
fer 


It has recently been proven that in fact (5.41) is not only sufficient, but also neces- 
sary for this [66]. 

= We can obtain an upper bound on both entropies introduced so far, by taking a 
supremum over all possible samples, instead of the expectation. This leads to the 
growth function, 


Gs(m) = max InN (F, (x1, 2225 (Xm5Ym)) 5.43 
a ) (£1 5Y1) yoe4y(%m Ym) EXX{1} ( l ! n) ( á ) l ) 


Note that by definition, the growth function is the logarithm of the shattering 
coefficient, G3(m) = In N(F, m). 
The convergence 


jim, Gg(m)/m =0, (5.44) 


is necessary and sufficient for exponentially fast convergence of risk for all under- 
lying distributions P. 
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= The next step will be to summarize the main behavior of the growth function 
with a single number. If F is as rich as possible, so that for any sample of size m, 
the points can be chosen such that by using functions of the learning machine, they 
can be separated in all 2” possible ways (i.e., they can be shattered), then 


Gg(m) = m - In(2). (5.45) 


In this case, the convergence (5.44) does not take place, and learning will not 
generally be successful. What about the other case? Vapnik and Chervonenkis 
[567, 568] showed that either (5.45) holds true for all m, or there exists some 
maximal m for which (5.45) is satisfied. This number is called the VC dimension 
and is denoted by h. If the maximum does not exist, the VC dimension is said to 
be infinite. 

By construction, the VC dimension is thus the maximal number of points which 
can be shattered by functions in F. It is possible to prove that for m > h [568], 


Gs(m) < h (nZ + 1) . (5.46) 


This means that up to m = h, the growth function increases linearly with the 
sample size. Thereafter, it only increases logarithmically, i.e., much more slowly. 
This is the regime where learning can succeed. 


Although we do not make use of it in the present chapter, it is worthwhile to 
also introduce the VC dimension of a class of real-valued functions { fwļw € A} at this 
stage. It is defined to equal the VC dimension of the class of indicator functions 


{sen fu = lw EA, 6 € (ink f(s) sup fat) b, 6.7) 
In summaty, we get a succession of capacity concepts, 
Ha(m) < H"(m) < Gs(m) < h (nZ n 1) I (5.48) 


From left to right, these become less precise. The entropies on the left are 
distribution-dependent, but rather difficult to evaluate (see, e.g., [430, 391]). The 
growth function and VC dimension are distribution-independent. This is less ac- 
curate, and does not always capture the essence of a given problem, which might 
have a much more benign distribution than the worst case; on the other hand, we 
want the learning machine to work for all distributions. If we knew the distribu- 
tion beforehand, then we would not need a learning machine anymore. 

Let us look at a simple example of the VC dimension. As a function class, we 
consider hyperplanes in R’, i.e., 


f(x) = sgn (a + b[x]; +c[x]2), with parameters a,b,c € R. (5.49) 


Suppose we are given three points x1,X2,x3 which are not collinear. No matter 
how they are labelled (that is, independent of our choice of y1, y2, y3 € {+1}), we 
can always find parameters a, b,c € R such that f(x;) = y; for all i (see Figure 1.4 in 
the introduction). In other words, there exist three points that we can shatter. This 
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shows that the VC dimension of the set of hyperplanes in R satisfies h > 3. On the 
other hand, we can never shatter four points. It follows from simple geometry that 
given any four points, there is always a set of labels such that we cannot realize the 
corresponding classification. Therefore, the VC dimension is h = 3. More generally, 
for hyperplanes in R“, the VC dimension can be shown to be h = N +1. For a 
formal derivation of this result, as well as of other examples, see [523]. 

How does this fit together with the fact that SVMs can be shown to correspond 
to hyperplanes in feature spaces of possibly infinite dimension? The crucial point 
is that SVMs correspond to large margin hyperplanes. Once the margin enters, the 
capacity can be much smaller than the above general VC dimension of hyper- 
planes. For simplicity, we consider the case of hyperplanes containing the origin. 


Theorem 5.5 (Vapnik [559]) Consider hyperplanes (w,x) = 0, where w is normalized 


such that they are in canonical form w.r.t. a set of points X* = {x1,...,xr}; Le., 
min mee) | "1, (5.50) 
I=L ysis 


The set of decision functions fw(x) = sgn (x,w) defined on X*, and satisfying the con- 
straint ||w|| < A, has a VC dimension satisfying 


hee, (5.51) 
Here, R is the radius of the smallest sphere centered at the origin and containing X*. 
Before we give a proof, several remarks are in order. 


= The theorem states that we can control the VC dimension irrespective of the 
dimension of the space by controlling the length of the weight vector ||w||. Note, 
however, that this needs to be done a priori, by choosing a value for A. It therefore 
does not strictly motivate what we will later see in SVMs, where ||w|| is minimized 
in order to control the capacity. Detailed treatments can be found in the work of 
Shawe-Taylor et al. [491, 24, 125]. 


= There exists a similar result for the case where R is the radius of the smallest 
sphere (not necessarily centered at the origin) enclosing the data, and where we 
allow for the possibility that the hyperplanes have a nonzero offset b [562]. In this 
case, we give asimple visualization in figure Figure 5.4, which shows it is plausible 
that enforcing a large margin amounts to reducing the VC dimension. 


= Note that the theorem talks about functions defined on X*. To extend it to the 
case where the functions are defined on all of the input domain X, it is best to state 
it for the fat shattering dimension. For details, see [24]. 


The proof [24, 222, 559] is somewhat technical, and can be skipped if desired. 


Proof Let us assume that x;,...,x; are shattered by canonical hyperplanes with 
||w|| < A. Consequently, for all y;,...,y, E {+1}, there exists a w with ||w]|| < A, 
such that 


VGN SG) > 1 forali=1,...,r. (5.52) 
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Figure 5.4 Simple visualization of the fact that enforcing a large margin of separation 
amounts to limiting the VC dimension. Assume that the data points are contained in a ball 
of radius R (cf. Theorem 5.5). Using hyperplanes with margin 71, it is possible to separate 
three points in all possible ways. Using hyperplanes with the larger margin 2, this is only 
possible for two points, hence the VC dimension in that case is two rather than three. 


The proof proceeds in two steps. In the first part, we prove that the more points we 
want to shatter (5.52), the larger || £j- y;x;|| must be. In the second part, we prove 
that we can upper bound the size of || }/_, y;x;|| in terms of R. Combining the two 
gives the desired condition, which tells us the maximum number of points we can 
shatter. 

Summing (5.52) over i = 1,...,r yields 


(w. (3 vx) ) >f: (5.53) 
i=1 


By the Cauchy-Schwarz inequality, on the other hand, we have 


(w (3 va) ) < ||w|| > yix <A Dix (5.54) 
i=l i=1 i=1 
Here, the second inequality follows from ||w|| < A. 
Combining (5.53) and (5.54), we get the desired lower bound, 
Z < |5 yx (5.55) 
A i=1 


We now move on to the second part. Let us consider independent random labels 
yi E€ {+1} which are uniformly distributed, sometimes called Rademacher variables. 
Let E denote the expectation over the choice of the labels. Exploiting the linearity 
of E, we have 


2 
= dE (vs 5 vasi) 


i=1 j=l 


selo (Gs) 9) 
i= jAi 


r 


> YiXi 


i=l 


E 
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i=l \ Uži 


= 5 (Pi +E ra) 


= YEllyxill’, (5.56) 
i=1 


where the last equality follows from the fact that the Rademacher variables have 
zero mean and are independent. Exploiting the fact that ||y;x;|| = ||xi|| < R, we get 
2 

<rR?. (5.57) 


r 


> YiXi 


i=1 


E 


Since this is true for the expectation over the random choice of the labels, there 
must be at least one set of labels for which it also holds true. We have so far made 
no restrictions on the labels, hence we may now use this specific set of labels. This 
leads to the desired upper bound, 


7 
5 YiXi 
i=1 


Combining the upper bound with the lower bound (5.55), we get 


2 
< rR. (5.58) 


A 

F< rR; (5.59) 
hence, 

r< RK. (5.60) 


In other words, if the r points are shattered by a canonical hyperplane satisfying 
the assumptions we have made, then r is constrained by (5.60). The VC dimension 
h also satisfies (5.60), since it corresponds to the maximum number of points that 
can be shattered. a 


In the next section, we give an application of this theorem. Readers only interested 
in the theoretical background of learning theory may want to skip this section. 


5.6 A Model Selection Example 


In the following example, taken from [470], we use a bound of the form (5.36) 
to predict which kernel would perform best on a character recognition problem 
(USPS set, see Section A.1). Since the problem is essentially separable, we disre- 
gard the empirical risk term in the bound, and choose the parameters of a polyno- 
mial kernel by minimizing the second term. Note that the second term is a mono- 
tonic function of the capacity. As a capacity measure, we use the upper bound on 
the VC dimension described in Theorem 5.5, which in turn is an upper bound on 
the logarithm of the covering number that appears in (5.36) (by the arguments put 
forward in Section 5.5.6). 
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Figure 5.5 Average VC dimension (solid), and total number of test errors, of ten two- 
class-classifiers (dotted) with polynomial degrees 2 through 7, trained on the USPS set of 
handwritten digits. The baseline 174 on the error scale, corresponds to the total number 
of test errors of the ten best binary classifiers, chosen from degrees 2 through 7. The graph 
shows that for this problem, which can essentially be solved with zero training error for all 
degrees greater than 1, the VC dimension allows us to predict that degree 4 yields the best 
overall performance of the two-class-classifier on the test set (from [470, 467]). 


We employ a version of Theorem 5.5, which uses the radius of the smallest 
sphere containing the data in a feature space H associated with the kernel k [561]. 
The radius was computed by solving a quadratic program [470, 85] (cf. Section 8.3). 
We formulate the problem as follows: 

minimize R? 
R>0,x* EH (5.61) 
subject to ||x; — x*||? < R?, 


where x* is the center of the sphere, and is found in the course of the optimization. 
Employing the tools of constrained optimization, as briefly described in Chapter 1 
(for details, see Chapter 6), we construct a Lagrangian, 


R? — A(R?) =x}, (5.62) 
i=1 

and compute the derivatives with respect to x* and R, to get 

y= > AXi (5.63) 
i=1 


and the Wolfe dual problem: 


m m 


maximize Ð A (xax) $, AA; xix), (5.64) 
ACR” j= ij=1 
subject to ` E e (5.65) 
i=1 
where A is the vector of all Lagrange multipliers \j,i=1,...,m. 


As in the Support Vector algorithm, this problem has the property that the x; 
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appear only in dot products, so we can again compute the dot products in feature 
space, replacing (x;,x;) by k(x;,x;) (where the x; belong to the input domain X, 
and the x; in the feature space H). 

As Figure 5.5 shows, the VC dimension bound, using the radius R computed in 
this way, gives a rather good prediction of the error on an independent test set. 


In this chapter, we introduced the main ideas of statistical learning theory. For 
learning processes utilizing empirical risk minimization to be successful, we need 
a version of the law of large numbers that holds uniformly over all functions the 
learning machine can implement. For this uniform law to hold true, the capacity 
of the set of functions that the learning machine can implement has to be “well- 
behaved.” We gave several capacity measures, such as the VC dimension, and 
illustrated how to derive bounds on the test error of a learning machine, in terms 
of the training error and the capacity. We have, moreover, shown how to bound 
the capacity of margin classifiers, a result which will later be used to motivate the 
Support Vector algorithm. Finally, we described an application in which a uniform 
convergence bound was used for model selection. 

Whilst this discussion of learning theory should be sufficient to understand 
most of the present book, we will revisit learning theory at a later stage. In Chap- 
ter 12, we will present some more advanced material, which applies to kernel 
learning machines. Specifically, we will introduce another class of generalization 
error bound, building on a concept of stability of algorithms minimizing regular- 
ized risk functionals. These bounds are proven using concentration-of-measure in- 
equalities, which are themselves generalizations of Chernoff and Hoeffding type 
bounds. In addition, we will discuss leave-one-out and PAC-Bayesian bounds. 


5.8 Problems 


5.1 (No Free Lunch in Kernel Choice ee) Discuss the relationship between the “no- 
free-lunch Theorem” and the statement that there is no free lunch in kernel choice. 


5.2 (Error Counting Estimate [136] e) Suppose you are given a test set with n elements 
to assess the accuracy of a trained classifier. Use the Chernoff bound to quantify the 
probability that the mean error on the test set differs from the true risk by more than e > 0. 
Argue that the test set should be as large as possible, in order to get a reliable estimate of 
the performance of a classifier. 


5.3 (The Tainted Die ee) A con-artist wants to taint a die such that it does not generate 
any '6’ when cast. Yet he does not know exactly how. So he devises the following scheme: 


5.8 Problems 
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he makes some changes and subsequently rolls the die 20 times to check that no ’6’ occurs. 
Unless pleased with the outcome, he changes more things and repeats the experiment. 

How long will it take on average, until, even with a perfect die, he will be convinced that 
he has a die that never generates a ‘6’? What is the probability that this already happens 
at the first trial? Can you improve the strategy such that he can be sure the die is ‘well’ 
tainted (hint: longer trials provide increased confidence)? 


5.4 (Chernoff Bound for the Deviation of Empirical Means ee) Use (5.6) and the 
triangle inequality to prove that 


J Z 2m me 
rl dX me >= &il > e} <4 exp (-) (5.66) 
i=m+1 


Next, note that the bound (5.66) is symmetric in how it deals with the two halves of the 
sample. Therefore, since the two events 


{a de-$ 5 ood (5.67) 


i=m+1 
and 
1 vt 2m 
5 2 £i eee -2 éi < = -e} (5.68) 
i=m+1 


are disjoint, argue that (5.32) holds true. See also Corollary 6.34 below. 


5.5 (Consistency and Uniform Convergence ee) Why can we not get a bound on the 
generalization error of a learning algorithm by applying (5.11) to the outcome of the 
algorithm? Argue that since we do not know in advance which function the learning 
algorithm returns, we need to consider the worst possible case, which leads to uniform 
convergence considerations. 

Speculate whether there could be restrictions on learning algorithms which imply that 
effectively, empirical risk minimization only leads to a subset of the set of all possible 
functions. Argue that this amounts to restricting the capacity. Consider as an example 
neural networks with back-propagation: if the training algorithm always returns a local 
minimum close to the starting point in weight space, then the network effectively does not 
explore the whole weight (i.e., function) space. 


5.6 (Confidence Interval and Uniform Convergence e) Derive (5.36) from (5.35). 


5.7 (Representer Algorithms for Minimizing VC Bounds 000) Construct kernel al- 
gorithms that are more closely aligned with VC bounds of the form (5.36). Hint: in the 
risk functional, replace the standard SV regularizer ||w||* with the second term of (5.36), 
bounding the shattering coefficient with the VC dimension bound (Theorem 5.5). Use the 
representer theorem (Section 4.2) to argue that the minimizer takes the form of a kernel 
expansion in terms of the training examples. Find the optimal expansion coefficients by 
minimizing the modified risk functional over the choice of expansion coefficients. 
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5.8 (Bounds in Terms of the VC Dimension e) From (5.35) and (5.36), derive bounds 
in terms of the growth function and the VC dimension, using the results of Section 5.5.6. 
Discuss the conditions under which they hold. 


5.9 (VC Theory and Decision Theory eee) (i) Discuss the relationship between mini- 
max estimation (cf. footnote 7 in Chapter 1) and VC theory. Argue that the VC bounds can 
be made “worst case” over distributions by picking suitable capacity measures. However, 
they only bound the difference between empirical risk and true risk, thus they are only 
“worst case” for the variance term, not for the bias (or empirical risk). The minimization 
of an upper bound on the risk of the form (5.36) as performed in SRM is done in order to 
construct an induction principle rather than to make a minimax statement. Finally, note 
that the minimization is done with respect to a structure on the set of functions, while in 
the minimax paradigm one takes the minimum directly over (all) functions. 

(ii) Discuss the following folklore statement: “VC statisticians do not care about doing 
the optimal thing, as long as they can guarantee how well they are doing. Bayesians do not 
care how well they are doing, as long as they are doing the optimal thing.” 


5.10 (Overfitting on the Test Set eee) Consider a learning algorithm which has a free 
parameter C. Suppose you randomly pick n values Cy,...,Cn, and for each n, you train 
your algorithm. At the end, you pick the value for C which did best on the test set. How 
would you expect your misjudgment of the true test error to scale with n? 

How does the situation change if the C; are not picked randomly, but by some adaptive 
scheme which proposes new values of C by looking at how the previous ones did, and 
guessing which change of C would likely improve the performance on the test set? 


5.11 (Overfitting the Leave-One-Out Error ee) Explain how it is possible to overfit 
the leave-one-out error. I.e., consider a learning algorithm that minimizes the leave-one-out 
error, and argue that it is possible that this algorithm will overfit. 


5.12 (Learning Theory for Differential Equations ooo) Can you develop a statistical 
theory of estimating differential equations from data? How can one suitably restrict the 
“capacity” of differential equations? 

Note that without restrictions, already ordinary differential equations may exhibit be- 
havior where the capacity is infinite, as exemplified by Rubel’s universal differential equa- 
tion [447] 


3y*y" y"? — gy tyy + by y y y + 24y yty" 


-12y yy _ 29y?y/P y"? a 12y"” —0. (5.69) 


Rubel proved that given any continuous function f : R —> R and any positive continuous 
function £ : R > Rt, there exists a C® solution y of (5.69) such that |y(t) — f(t)| < elt) 
for all t € R Therefore, all continuous functions are uniform limits of sequences of 
solutions of (5.69). Moreover, y can be made to agree with f at a countable number of 
distinct points (t;). Further references of interest to this problem include [61, 78, 63]. 


Overview 


Optimization 


This chapter provides a self-contained overview of some of the basic tools needed 
to solve the optimization problems used in kernel methods. In particular, we will 
cover topics such as minimization of functions in one variable, convex minimiza- 
tion and maximization problems, duality theory, and statistical methods to solve 
optimization problems approximately. 

The focus is noticeably different from the topics covered in works on optimiza- 
tion for Neural Networks, such as Backpropagation [588, 452, 317, 7] and its vari- 
ants. In these cases, it is necessary to deal with non-convex problems exhibiting a 
large number of local minima, whereas much of the research on Kernel Methods 
and Mathematical Programming is focused on problems with global exact solu- 
tions. These boundaries may become less clear-cut in the future, but at the present 
time, methods for the solution of problems with unique optima appear to be suffi- 
cient for our purposes. 

In Section 6.1, we explain general properties of convex sets and functions, and 
how the extreme values of such functions can be found. Next, we discuss practical 
algorithms to best minimize convex functions on unconstrained domains (Section 
6.2). In this context, we will present techniques like interval cutting methods, 
Newton’s method, gradient descent and conjugate gradient descent. Section 6.3 
then deals with constrained optimization problems, and gives characterization 
results for solutions. In this context, Lagrangians, primal and dual optimization 
problems, and the Karush-Kuhn-Tucker (KKT) conditions are introduced. These 
concepts set the stage for Section 6.4, which presents an interior point algorithm 
for the solution of constrained convex optimization problems. In a sense, the final 
section (Section 6.5) is a departure from the previous topics, since it introduces 
the notion of randomization into the optimization procedures. The basic idea is 
that unless the exact solution is required, statistical tools can speed up search 
maximization by orders of magnitude. 

For a general overview, we recommend Section 6.1, and the first parts of Sec- 
tion 6.3, which explain the basic ideas underlying constrained optimization. The 
latter section is needed to understand the calculations which lead to the dual opti- 
mization problems in Support Vector Machines (Chapters 7-9). Section 6.4 is only 
intended for readers interested in practical implementations of optimization al- 
gorithms. In particular, Chapter 10 will require some knowledge of this section. 
Finally, Section 6.5 describes novel randomization techniques, which are needed 
in the sparse greedy methods of Section 10.2, 15.3, 16.4, and 18.4.3. Unconstrained 
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optimization problems (Section 6.2) are less common in this book and will only 
be required in the gradient descent methods of Section 10.6.1, and the Gaussian 
Process implementation methods of Section 16.4. 

The present chapter is intended as an introduction to the basic concepts of 
optimization. It is relatively self-contained, and requires only basic skills in linear 
algebra and multivariate calculus. Section 6.3 is somewhat more technical, Section 
6.4 requires some additional knowledge of numerical analysis, and Section 6.5 
assumes some knowledge of probability and statistics. 


6.1 Convex Optimization 


Definition and 
Construction of 
Convex Sets and 
Functions 


In the situations considered in this book, learning (or equivalently statistical es- 
timation) implies the minimization of some risk functional such as Remp[f] or 
Rreg[f] (cf. Chapter 4). While minimizing an arbitrary function on a (possibly not 
even compact) set of arguments can be a difficult task, and will most likely exhibit 
many local minima, minimization of a convex objective function on a convex set 
exhibits exactly one global minimum. We now prove this property. 


Definition 6.1 (Convex Set) A set X in a vector space is called convex if for any x, x’ € 
X and any X € [0,1], we have 
Ax+(1—A)x' EX. (6.1) 


Definition 6.2 (Convex Function) A function f defined on a set X (note that X need 
not be convex itself) is called convex if, for any x,x' € X and any A € [0,1] such that 
Ax + (1 — A)x! € X, we have 


f(Ax + (1 Ax’) < Af(x) +1 — ADF (x’). (6.2) 
A function f is called strictly convex if for x # x’ and A € (0,1) (6.2) is a strict inequality. 
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Figure 6.1 Left: Convex Function in two variables. Right: the corresponding convex level 
sets {x| f(x) < c}, for different values of c. 


There exist several ways to define convex sets. A convenient method is to define 
them via below sets of convex functions, such as the sets for which f(x) < c, for 
instance. 


Lemma 6.3 (Convex Sets as Below-Sets) Denote by f : X — R a convex function on 
a convex set X. Then the set 

X := {x|x € Xand f(x) < c}, for all c € R, (6.3) 
is convex. 


Proof We must show condition (6.1). For any x, x’ € X, we have f(x), f(x’) < c. 
Moreover, since f is convex, we also have 


fAx ++ Ax’ < Af +A A) F(x’) < c for all A € [0,1]. (6.4) 
Hence, for all A € [0,1], we have (Ax + (1 — A)x’) € X, which proves the claim. 
Figure 6.1 depicts this situation graphically. E 


Lemma 6.4 (Intersection of Convex Sets) Denote by X, X’ C X two convex sets. Then 
XA X’ is also a convex set. 


Proof Given any x, x’ € XN X’, then for any À € [0,1], the point x) := Ax + (1 — 
A)x' satisfies x, € X and x, € X’, hence also x, € XN X’. E 


See also Figure 6.2. Now we have the tools to prove the central theorem of this 
section. 


Theorem 6.5 (Minima on Convex Sets) If the convex function f : X — R has a min- 
imum on a convex set X C X, then its arguments x € X, for which the minimum value 
is attained, form a convex set. Moreover, if f is strictly convex, then this set will contain 
only one element. 
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Figure 6.2 Left: a convex set; observe that lines with points in the set are fully contained 
inside the set. Right: the intersection of two convex sets is also a convex set. 
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Figure 6.3 Note that the maximum 
of a convex function is obtained at 
the ends of the interval [a, b]. 
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Proof Denote by c the minimum of f on X. Then the set Xn := {x|x € X and f(x) < 
c} is clearly convex. In addition, Xm N X is also convex, and f(x) = c for all 
x € Xm N X (otherwise c would not be the minimum). 

If f is strictly convex, then for any x, x’ € X, and in particular for any x, x’ € 
XM Xm, we have (for x Æ x’ and all A € (0,1)), 


fOAXx +(x) < Af lx) + (1 — F(x) = Ac + (1— Ale =e. (6.5) 


This contradicts the assumption that Xm N X contains more then one element. 
E 


A simple application of this theorem is in constrained convex minimization. Recall 
that the notation [n], used below, is a shorthand for {1,...,n}. 


Corollary 6.6 (Constrained Convex Minimization) Given the set of convex func- 
tions f ,C1,...,Cn on the convex set X, the problem 
minimize f(x) 
x FO), (6.6) 
subject to c(x) <0 forall i € [n], 
has as its solution a convex set, if a solution exists. This solution is unique if f is strictly 
convex. 


Many problems in Mathematical Programming or Support Vector Machines can 
be cast into this formulation. This means either that they all have unique solutions 
(if f is strictly convex), or that all solutions are equally good and form a convex set 
(if f is merely convex). 

We might ask what can be said about convex maximization. Let us analyze a 
simple case first: convex maximization on an interval. 
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Lemma 6.7 (Convex Maximization on an Interval) Denote by f a convex function 
on [a,b] E€ R Then the problem of maximizing f on [a,b] has f(a) and f(b) as solutions. 


Proof Any x € [a,b] canbe written as }=ža + (1 — *) b, and hence 


b-—x b—x 
Fl) < Ero + (1 = F=*) f00) < mafo), FO). (67) 
Therefore the maximum of f on [a,b] is obtained on one of the points a, b. a 


We will next show that the problem of convex maximization on a convex set is 
typically a hard problem, in the sense that the maximum can only be found at one 
of the extreme points of the constraining set. We must first introduce the notion of 
vertices of a set. 


Definition 6.8 (Vertex of a Set) A point x € X is a vertex of X if, for all x' € X with 
x’ + x, and for all A > 1, the point Ax + (1 — A)x’ ¢ X. 


This definition implies, for instance, that in the case of X being an £2 ball, the 
vertices of X make up its surface. In the case of an l% ball, we have 2” vertices in 
n dimensions, and for an ¢; ball, we have only 2n of them. These differences will 
guide us in the choice of admissible sets of parameters for optimization problems 
(see, e.g., Section 14.4). In particular, there exists a connection between suprema 
on sets and their convex hulls. To state this link, however, we need to define the 
latter. 


Definition 6.9 (Convex Hull) Denote by X a set in a vector space. Then the convex hull 
co X is defined as 


co X = f: 


Theorem 6.10 (Suprema on Sets and their Convex Hulls) Denote by X a set and by 
co X its convex hull. Then for a convex function f 


sup{f(x)|x € X} = sup{f(x)|x € co X}. (6.9) 


i=1 i=1 


Z= X aix; where n € N,a; > Oand Ya; =1 ) ; (6.8) 


Proof Recall that the below set of convex functions is convex (Lemma 6.3), and 
that the below set of f with respect to c = sup{f(x)|x € X} is by definition a 
superset of X. Moreover, due to its convexity, it is also a superset of co X. E 


This theorem can be used to replace search operations over sets X by subsets 
X' C X, which are considerably smaller, if the convex hull of the latter generates 
X. In particular, the vertices of convex sets are sufficient to reconstruct the whole 
set. 


Theorem 6.11 (Vertices) A compact convex set is the convex hull of its vertices. 


154 


Reconstructing 
Convex Sets from 
Vertices 


Optimization 


Figure 6.4 A convex function on a convex 
polyhedral set. Note that the minimum of this 
function is unique, and that the maximum 
can be found at one of the vertices of the con- 
straining domain. 
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The proof is slightly technical, and not central to the understanding of kernel 
methods. See Rockafellar [435, Chapter 18] for details, along with further theorems 
on convex functions. We now proceed to the second key theorem in this section. 


Theorem 6.12 (Maxima of Convex Functions on Convex Compact Sets) Denote 
by X a compact convex set in X, by |X the vertices of X, and by f a convex function 
on X. Then 


sup{f(x)|x € X} = sup{f(x)|x € [X}. (6.10) 


Proof Application of Theorem 6.10 and Theorem 6.11 proves the claim, since 
under the assumptions made on X, we have X = co(|X). Figure 6.4 depicts the 
situation graphically. E 


6.2 Unconstrained Problems 


Continuous 
Differentiable 
Functions 


After the characterization and uniqueness results (Theorem 6.5, Corollary 6.6, and 
Lemma 6.7) of the previous section, we will now study numerical techniques to 
obtain minima (or maxima) of convex optimization problems. While the choice 
of algorithms is motivated by applicability to kernel methods, the presentation 
here is not problem specific. For details on implementation, and descriptions of 
applications to learning problems, see Chapter 10. 


6.2.1 Functions of One Variable 


We begin with the easiest case, in which f depends on only one variable. Some of 
the concepts explained here, such as the interval cutting algorithm and Newton’s 
method, can be extended to the multivariate setting (see Problem 6.5). For the sake 
of simplicity, however, we limit ourselves to the univariate case. 

Assume we want to minimize f : R > R on the interval [a,b] C R. If we cannot 
make any further assumptions regarding f, then this problem, as simple as it may 
seem, cannot be solved numerically. 

If f is differentiable, the problem can be reduced to finding f'(x) = 0 (see Prob- 
lem 6.4 for the general case). If in addition to the previous assumptions, f is con- 
vex, then f’ is nondecreasing, and we can find a fast, simple algorithm (Algorithm 
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Figure 6.5 Interval Cutting Algorithm. The selection of points is ordered according to the 
numbers beneath (points 1 and 2 are the initial endpoints of the interval). 


Algorithm 6.1 Interval Cutting 


Require: a,b, Precision € 


Set A=a,B=b 
repeat 
if f' a] > 0 then 
B = 244 
2 
else 
A= A+B 
end if 
until (B — A) min(|f'(A)|, |f (B)|) < € 


Output: x= 4 


6.1) to solve our problem (see Figure 6.5). 

This technique works by halving the size of the interval that contains the min- 
imum x* of f, since it is always guaranteed by the selection criteria for B and A 
that x* € [A, B]. We use the following Taylor series expansion to determine the 
stopping criterion. 


Theorem 6.13 (Taylor Series) Denote by f : R > R a function that is d times differen- 
tiable. Then for any x,x' € R, there exists a £ with || < |x — x'|, such that 


d—1 To , d 

fl) =F Ff na! — yi + FMS. (6.11 
i=0 ** ° 

Now we may apply (6.11) to the stopping criterion of Algorithm 6.1. We denote 

by x* the minimum of f(x). Expanding f around f(x*), we obtain for some E4 € 

[A — x*,0] that f(A) = f(x*) + €4f’(x* + £a), and therefore, 


A) — FO") = Ello + EA < E -AAN 


Taking the minimum over {A, B} shows that Algorithm 6.1 stops once f is e-close 
to its minimal value. The convergence of the algorithm is linear with constant 0.5, 
since the intervals [A, B] for possible x* are halved at each iteration. 
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Algorithm 6.2 Newton’s Method 


Require: xo, Precision € 


Set x = Xo 
repeat 
ij 
x=x saa 
until | f’(x)| < € 
Output: x 


In constructing the interval cutting algorithm, we in fact wasted most of the 
information obtained in evaluating f’ at each point, by only making use of the 
sign of f. In particular, we could fit a parabola to f and thereby obtain a method 
that converges more rapidly. If we are only allowed to use f and f’, this leads to 
the Method of False Position (see [334] or Problem 6.3). 

Moreover, if we may compute the second derivative as well, we can use (6.11) to 
obtain a quadratic approximation of f and use the latter to find the minimum of f. 
This is commonly referred to as Newton's method (see Section 16.4.1 for a practical 
application of the latter to classification problems). We expand f(x) around xo; 


y2 
Fla) & Flor) + (x= xof + EEE). (6.12) 
Minimization of the expansion (6.12) yields 
= f'(x0) 
x = X0 — Fig). (6.13) 


Hence, we hope that if the approximation (6.12) is good, we will obtain an algo- 
rithm with fast convergence (Algorithm 6.2). Let us analyze the situation in more 
detail. For convenience, we state the result in terms of g := f’, since finding a zero 
of g is equivalent to finding a minimum of f. 


Theorem 6.14 (Convergence of Newton Method) Let g : R > R be a twice continu- 
ously differentiable function, and denote by x* € Ra point with g’(x*) A 0 and g(x*) =0. 
Then, provided xo is sufficiently close to x*, the sequence generated by (6.13) will converge 
to x* at least quadratically. 


Proof For convenience, denote by x, the value of x at the nth iteration. As before, 
we apply Theorem 6.13. We now expand g(x*) around x,,. For some € € [0, x* — xn], 
we have 


2 
8(Xn) = B(Xn) — LŽ) = g&n) — [sts + (xn) = Xn) + E g'(x) ; (6.14) 


and therefore by substituting (6.14) into (6.13), 
8(Xn) 2 &' (Xn) ( 
_ 6.15) 
8'(Xn) 28'(Xn) 
Since by construction || < |x, — x*|, we obtain a quadratically convergent algo- 


rithm in |x, — x*|, provided that lox — x) <1. 7 


Xn41 —X* = Xn —x* — 
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In other words, if the Newton method converges, it converges more rapidly than 
interval cutting or similar methods. We cannot guarantee beforehand that we are 
really in the region of convergence of the algorithm. In practice, if we apply 
the Newton method and find that it converges, we know that the solution has 
converged to the minimizer of f. For more information on optimization algorithms 
for unconstrained problems see [173, 530, 334, 15, 159, 45]. 

In some cases we will not know an upper bound on the size of the interval to be 
analyzed for the presence of minima. In this situation we may, for instance, start 
with an initial guess of an interval, and if no minimum can be found strictly inside 
the interval, enlarge it, say by doubling its size. See [334] for more information on 
this matter. Let us now proceed to a technique which is quite popular (albeit not 
always preferable) in machine learning. 


6.2.2 Functions of Several Variables: Gradient Descent 


Gradient descent is one of the simplest optimization techniques to implement for 
minimizing functions of the form f : X + R, where X may be RN, or indeed any 
set on which a gradient may be defined and evaluated. In order to avoid further 
complications we assume that the gradient f'(x) exists and that we are able to 
compute it. 

The basic idea is as follows: given a location x, at iteration n, compute the 
gradient g, := f'(xn), and update 


Xn = Xn — Sn (6.16) 


such that the decrease in f is maximal over all y > 0. For the final step, one of the 
algorithms from Section 6.2.1 can be used. It is straightforward to show that f(x,) 
is a monotonically decreasing series, since at each step the line search updates X41 
in such a way that f(%n41) < f(Xn). Such a value of y must exist, since (again by 
Theorem 6.13) we may expand f(x, + ygn) in terms of y around x, to obtain! 


f (Xn — Yn) = f(%n) — YIIgnl/? + 00°). (6.17) 


As usual || - || is the Euclidean norm. For small y the linear contribution in the 
Taylor expansion will be dominant, hence for some y > 0 we have f (x1 — ¥@n) < 
f (Xn). It can be shown [334] that after a (possibly infinite) number of steps, gradient 
descent (see Algorithm 6.3) will converge. 

In spite of this, the performance of gradient descent is far from optimal. De- 
pending on the shape of the landscape of values of f, gradient descent may take 
a long time to converge. Figure 6.6 shows two examples of possible convergence 
behavior of the gradient descent algorithm. 


1. To see that Theorem 6.13 applies in (6.17), note that f(x, + gx) is a mapping R —> R 
when viewed as a function of y. 
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Algorithm 6.3 Gradient Descent 


Require: xo, Precision € 
n=0 
repeat 
Compute g = f'(xn) 
Perform line search on f(x, — yg) for optimal y. 


Xn41 = Xn — VS 
n=n+1 
until || f’(x,)|| < € 
Output: x, 


Figure 6.6 Left: Gradient descent takes a long time to converge, since the landscape of 
values of f forms a long and narrow valley, causing the algorithm to zig-zag along the 
walls of the valley. Right: due to the homogeneous structure of the minimum, the algorithm 
converges after very few iterations. Note that in both cases, the next direction of descent is 
orthogonal to the previous one, since line search provides the optimal step length. 


6.2.3 Convergence Properties of Gradient Descent 


Let us analyze the convergence properties of Algorithm 6.3 in more detail. To keep 
matters simple, we assume that f is a quadratic function, i.e. 


f(x) = Le — x*)" K(x — x*) + co, (6.18) 


where K is a positive definite symmetric matrix (cf. Definition 2.4) and co is 
constant.? This is clearly a convex function with minimum at x*, and f(x*) = co. 
The gradient of f is given by 


g= fo = Ke]. (6.19) 


To find the update of the steepest descent we have to minimize 


1 | ot 
fle — 8) = 5(@ — 98 — **)K(x — 18 — 2") = 5 °8' Kg — 98'S. (6.20) 


2. Note that we may rewrite (up to a constant) any convex quadratic function f(x) = 
x™Kx-+c'x +d in the form (6.18), simply by expanding f around its minimum value x*. 
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By minimizing (6.20) for +, the update of steepest descent is given explicitly by 
T 


8 8 
Xn+1 = Xn — > 78. (6.21) 
+1 TK 


Substituting (6.21) into (6.18) and subtracting the terms f(x,) and f(xn41) yields 
the following improvement after an update step 


+. \2 
fan- fm) = (tn —¥ KEE g-i ( A) g'Kg 


glKg 8 
1gs _ (g'g)? 
~ 2 gTKg ee, lo e e 12a) 


a ac relative improvement per iteration depends on the value of t(g) := 
eens y Kay . In order to give performance guarantees we have to find a lower 
bound for Ig: ). To this end we introduce the condition of a matrix. 


Definition 6.15 (Condition of a Matrix) Denote by K a matrix and by Amax and Amin 
its largest and smallest singular values (or eigenvalues if they exist) respectively. The 
condition of a matrix is defined as 


cond K:= Ama (6.23) 
min 

Clearly, as cond K decreases, different directions are treated in a more homoge- 

neous manner by x ' Kx. In particular, note that smaller cond K correspond to less 

elliptic contours in Figure 6.6. Kantorovich proved the following inequality which 

allows us to connect the condition number with the convergence behavior of gra- 

dient descent algorithms. 


Theorem 6.16 (Kantorovich Inequality [278]) Denote by K € R”*” (typically the 
kernel matrix) a strictly positive definite symmetric matrix with largest and smallest 
eigenvalues Amax ANd Amin. Then the following inequality holds for any g € R": 


(g Tg ? 4 min Amax 1 


n an WZ sande 24 
(TKI Klg) = (Amin + Ama)? ^ cond K (6.24) 


We typically denote by g the gradient of f. The second inequality follows immedi- 
ately from Definition 6.15; the proof of the first inequality is more technical, and is 
not essential to the understanding of the situation. See Problem 6.7 and [278, 334] 
for more detail. 

A brief calculation gives us the correct order of magnitude. Note that for any 
x, the quadratic term x'Kx is bounded from above by Amaxl|x||?, and likewise 
xT Ktx < Az ||x||?. Hence we bound the relative improvement t(g) (as defined 
below (6.22)) by 1/(cond K) which is almost as good as the second term in (6.24) 
(the latter can be up to a factor of 4 better for Amin K Amax): 

This means that gradient descent methods perform poorly if some of the eigen- 
values of K are very small in comparison with the largest eigenvalue, as is usually 
the case with matrices generated by positive definite kernels (and as sometimes 
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desired for learning theoretical reasons); see Chapter 4 for details. This is one of 
the reasons why many gradient descent algorithms for training Support Vector 
Machines, such as the Kernel AdaTron [183, 12] or AdaLine [185], exhibit poor 
convergence. Section 10.6.1 deals with these issues, and sets up the gradient de- 
scent directions both in the Reproducing Kernel Hilbert Space H and in coefficient 
space R”. 


6.2.4 Functions of Several Variables: Conjugate Gradient Descent 


Let us now look at methods that are better suited to minimizing convex functions. 
Again, we start with quadratic forms. The key problem with gradient descent is 
that the quotient between the smallest and the largest eigenvalue can be very large, 
which leads to slow convergence. Hence, one possible technique is to rescale X by 
some matrix M such that the condition of K € R"*” in this rescaled space, which 
is to say the condition of M' KM, is much closer to 1 (in numerical analysis this is 
often referred to as preconditioning [247, 423, 530]). In addition, we would like to 
focus first on the largest eigenvectors of K. 

A key tool is the concept of conjugate directions. The basic idea is that rather than 
using the metric of the normal dot product x! x’ = x'1x’ (1 is the unit matrix) we 
use the metric imposed by K, i.e. x' Kx’, to guide our algorithm, and we introduce 
an equivalent notion of orthogonality with respect to the new metric. 


Definition 6.17 (Conjugate Directions) Given a symmetric matrix K € R"*™, any 
two vectors v, v! € R” are called K-orthogonal if v' Kv! = 0. 


Likewise, we can introduce notions of a basis and of linear independence with 
respect to K. The following theorem establishes the necessary identities. 


Theorem 6.18 (Orthogonal Decompositions in K) Denote by K € R"*" a strictly 
positive definite symmetric matrix and by vı, ..., Um a set of mutually K-orthogonal and 
nonzero vectors. Then the following properties hold: 

(i) The vectors v1, .. ., Um forma basis. 


(ii) Any x € R” can be expanded in terms of vi by 


m v} Kx 
= : 6.25 
x 2 Ui v] Ko; ( ) 
In particular, for any y = Kx, we can find x by 
m oly 
= : 6.26 
x 5 Ui v] Ko; ( ) 


i=1 


Proof (i) Since we have m vectors in R”, all we have to show is that the vectors v; 
are linearly independent. Assume that there exist some a; € R such that ¥7", ajvj = 
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0. Then due to K-orthogonality, we have 


= v] K 
=1 


m m 
5 oa => aiv] Ko; = ajv} Ko; for all j. (6.27) 
i=1 i= 


Hence a; = 0 for all j. This means that all vj are linearly independent. 


(ii) The vectors {v1,...,0m} form a basis. Therefore we may expand any x € R” 
as a linear combination of v;, i.e. x = Yj", ajv;. Consequently we can expand v] Kx 


in terms of vl Koi, and we obtain 


vj Kx = vj K 


m 
> oa = ajv} Koj. (6.28) 
i=1 


Solving for a j proves the claim. 


(iii) Let y = Kx. Since the vectors v; form a basis, we can expand x in terms of aj. 


Substituting this definition into (6.28) proves (6.26). - 


The practical consequence of this theorem is that, provided we know a set of K- 
orthogonal vectors v;, we can solve the linear equation y = Kx via (6.26). Fur- 
thermore, we can also use it to minimize quadratic functions of the form f(x) = 
5x'Kx — c" x. The following theorem tells us how. 


Theorem 6.19 (Deflation Method) Denote by v1, .. ., Uma set of mutually K-orthogonal 
vectors for a strictly positive definite symmetric matrix K € R"*™. Then for any xo € IR” 
the following method finds x; that minimize f(x) = xT Kx —c'x in the linear manifold 
Xi <=Xo+ span{v1, sees vi}. 


T ay. 
8i-1¥i 
v} Ko; 


Xj i= Xj_1 — Vj where gi—1 = f'(x;-1) for alli > 0. (6.29) 
Proof We use induction. For i = 0 the statement is trivial, since the linear mani- 
fold consists of only one point. 

Assume that the statement holds for i. Since f is convex, we only need prove 
that the gradient of f(x;) is orthogonal to span{v1,...,v;}. In that case no further 
improvement can be gained on the linear manifold X;. It suffices to show that for 
all j <i+1, 


0=0) gi. (6.30) 


Additionally, we may expand x;4; to obtain 


T v vl Ko; 
l pee e ae HT E 
0; 8i=0; |Kxi1—c irene = 0; gi- Bi) STKE, (6.31) 
For j =i both terms cancel out. For j < i both terms vanish due to the induction 
assumption. Since the vectors v; form a basis Xm = R”, Xm is a minimizer of f. 


In a nutshell, Theorem 6.19 already contains the Conjugate Gradient descent al- 
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Algorithm 6.4 Conjugate Gradient Descent 


Require: xo 


Seti =0 
pee y i 
go = f'(xo) 
Uo = go 
repeat 
To; 
Xia. = Xi + Qivi where a; = -S 
i i 


Sina = fl Xia) 


T 


Vint = — ini + bivi where f; = — 
i=i+1 ' 
until g; = 0 
Output: x; 


gorithm: in each step we perform gradient descent with respect to one of the K- 
orthogonal vectors v;, which means that after n steps we will reach the minimum. 
We still lack a method to obtain such a K-orthogonal basis of vectors v;. It turns out 
that we can get the latter directly from the gradients g;. Algorithm 6.4 describes the 
procedure. 

All we have to do is prove that Algorithm 6.4 actually does what it is required 
to do, namely generate a K-orthogonal set of vectors v;, and perform deflation in 
the latter. To achieve this, the v; are obtained by an orthogonalization procedure 
akin to Gram-Schmidt orthogonalization. 


Theorem 6.20 (Conjugate Gradient) Assume we are given a quadratic convex func- 
tion f(x) = xT Kx —c'x, to which we apply conjugate gradient descent for minimization 
purposes. Then algorithm 6.4 is a deflation method, and unless g; = 0, we have for every 
O<i<m, 


(i) span{go,..., i} =span{v,...,v;} =span{go, Kgo,..., Kigo}. 
(ii) The vectors v; are K-orthogonal. 
gi 8i 


(iti) The equations in Algorithm 6.4 for a; and p; can be replaced by aj = ai and 
eae 
bi = ee. 


(iv) After i steps, x; is the solution in the manifold xo + span{g0, Kgo,..., K180}. 


Proof (i) and (ii) We use induction. For i = 0 the statements trivially hold since 
V9 = go. For i note that by construction (see Algorithm 6.4) gi41 = KXi41 — € = gi + 
aiKv;, hence span{go,.--, &i41} = span{go, Kgo,..., KH! go}. Since Via = —gi1 + 


Bivi the same statement holds for span{vo,...,Vi41}. Moreover, the vectors g; are 
linearly independent or 0 due to Theorem 6.19. 
Finally v] Kviss = —v] Kgiyı + biw] Ko; = 0, since for j = i both terms cancel out, 


and for j < i both terms individually vanish (due to Theorem 6.19 and (i)). 
(iii) We have —g; vi = 8] gi — 6-18) vi-1 = g; gi, since the second term vanishes 
due to Theorem 6.19. This proves the result for aj. 
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Table 6.1 Non-quadratic modifications of conjugate gradient descent. 


Generic Method Compute Hessian K; := f’(x;) and update aj, B; with 
gj vi 
— sir Kivi 


B= 


This requires calculation of the Hessian at each iteration. 


Qi = — 
v; Kivi 


Fletcher-Reeves [173] | Find a; via a line search and use Theorem 6.20 (iii) for 6; 
Qi = argmin „f (xi + avi) 


ghasia 
i= 


gj 8i 
Polak-Ribiere [414] Find q; via a line search 


Qi = argmin, f(x; + avi) 

B= (Si+1 pta 

Experimentally, Polak-Ribiere tends to be better than 
Fletcher-Reeves. 


For b; note that g], Kv; = a7 'g),,(gi41 — gi) = 07 8i418i41- Substitution of the 
value of a; proves the claim. 

(iv) Again, we use induction. At step i = 1 we compute the solution within the 
space spanned by go. = 


We conclude this section with some remarks on the optimality of conjugate gradi- 
ent descent algorithms, and how they can be extended to arbitrary convex func- 
tions. 

Due to Theorems 6.19 and 6.20, we can see that after i iterations, the con- 
jugate gradient descent algorithm finds a solution on the linear manifold x9 + 
span{go, Kgo,.--, K'~1go}. This means that the solutions will be mostly aligned 
with the largest eigenvalues of K, since after multiple application of K to any arbi- 
trary vector go, the largest eigenvectors dominate. Nonetheless, the algorithm here 
is significantly cheaper than computing the eigenvalues of K, and subsequently 
minimizing f in the subspace corresponding to the largest eigenvalues. For more 
detail see [334] 

In the case of general convex functions, the assumptions of Theorem 6.20 are 
no longer satisfied. In spite of this, conjugate gradient descent has proven to 
be effective even in these situations. Additionally, we have to account for some 
modifications. Basically, the update rules for g; and vj remain unchanged but the 
parameters a; and ĝ; are computed differently. Table 6.1 gives an overview of 
different methods. See [173, 334, 530, 414] for details. 


6.2.5 Predictor Corrector Methods 


As we go to higher order Taylor expansions of the function f to be minimized 
(or set to zero), the corresponding numerical methods become increasingly com- 
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plicated to implement, and require an ever increasing number of parameters to 
be estimated or computed. For instance, a quadratic expansion of a multivariate 
function f : R” — R requires m x m terms for the quadratic part (the Hessian), 
whereas the linear part (the gradient) can be obtained by computing m terms. 
Since the quadratic expansion is only an approximation for most non-quadratic 
functions, this is wasteful (for interior point programs, see Section 6.4). We might 
instead be able to achieve roughly the same goal without computing the quadratic 
term explicitly, or more generally, obtain the performance of higher order methods 
without actually implementing them. 

This can in fact be achieved using predictor-corrector methods. These work 
by computing a tentative update x; > x?/"" (predictor step), then using x15" to 
account for higher order changes in the objective function, and finally obtaining 
a corrected value x9] based on these changes. A simple example illustrates the 
method. Assume we want to find the solution to the equation 


f(x) = 0 where f(x) = fo +ax + Tor. (6.32) 


We assume a,b, fo, x € R. Exact solution of (6.32) requires taking a square root. Let 
us see whether we can find an approximate method that avoids this (in general 
b will be an m x m matrix, so this is a worthwhile goal). The predictor corrector 
approach works as follows: first solve 


fo tax = 0 and hence xP"*4 = a (6.33) 


Second, substitute xP™4 into the nonlinear parts of (6.32) to obtain 


1 g 1b 
fo +ax®™ + zb (£) = 0 and hence x" = an (1 + 3A) ; (6.34) 
Comparing xP and x°", we see that 444 is the correction term that takes the 
effect of the changes in x into account. 

Since neither of the two values (xP"¢ or x°°") will give us the exact solution 
to f(x) = 0 in just one step, it is worthwhile having a look at the errors of both 


approaches. 


re = 1 bfo corr) — f (xema) Per) 
f(x") = Ta and f(x") = T + = (6.35) 


We can check that if th < 2 — 24/2, the corrector estimate will be better than the 
predictor one. As our initial estimate fọ decreases, this will be the case. Moreover, 
we can see that f(x") only contains terms in x that are of higher order than 
quadratic. This means that even though we did not solve the quadratic form 
explicitly, we eliminated all corresponding terms. 

The general scheme is described in Algorithm 6.5. It is based on the assumption 
that f(x + €) can be split up into 


fers) = f(x) + feimpie(, x) + T(E, x), (6.36) 
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Algorithm 6.5 Predictor Corrector Method 
Require: xo, Precision € 
Set z= 0 
repeat 
Expand f into f (xi) + feimpie(E xi) =F T(E, xi). 
Predictor Solve f(x;) + foimpie(€P™, x;) = 0 for £24. 
Corrector Solve f(x;) + feimpie(E°", xi) + T(EP*4, x;) = 0 for E". 
Nig = Xi 4 geom. 
i=i+l. 
until | f(x;)| < € 
Output: x; 


where fsimple(£, x) contains the simple, possibly low order, part of f, and T(€, x) 
the higher order terms, such that fsimpie(0, x) = T(0, x) = 0. While in the previous 
example we introduced higher order terms into f that were not present before (f is 
only quadratic), usually such terms will already exist anyway. Hence the corrector 
step will just eliminate additional lower order terms without too much additional 
error in the approximation. 

We will encounter such methods for instance in the context of interior point 
algorithms (Section 6.4), where we have to solve a set of quadratic equations. 


6.3 Constrained Problems 


After this digression on unconstrained optimization problems, let us return to 

constrained optimization, which makes up the main body of the problems we 

will have to deal with in learning (e.g., quadratic or general convex programs for 

Support Vector Machines). Typically, we have to deal with problems of type (6.6). 

For convenience we repeat the problem statement: 
minimize f(x) 


(6.37) 
subject to c;(x) < 0 for alli € [n]. 


Here f and c; are convex functions and n € N. In some cases, we additionally have 
equality constraints e;(x) = 0 for some j € [n’]. Then the optimization problem can 
be written as 


minimize f(x), 
subject to c;(x) < 0 for alli € [n], (6.38) 
e;(x) = 0 for all j € [n]. 


3. Note that it is common practice in Support Vector Machines to write c; as positivity 
constraints by using concave functions. This can be fixed by a sign change, however. 
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Before we start minimizing f, we have to discuss what optimality means in this 
case. Clearly f'(x) = 0 is too restrictive a condition. For instance, f’ could point 
into a direction which is forbidden by the constraints c; and e;. Then we could 
have optimality, even though f’ Æ 0. Let us analyze the situation in more detail. 


6.3.1 Optimality Conditions 


We start with optimality conditions for optimization problems which are indepen- 
dent of their differentiability. While it is fairly straightforward to state sufficient 
optimality conditions for arbitrary functions f and c;, we will need convexity and 
“reasonably nice” constraints (see Lemma 6.23) to state necessary conditions. This 
is not a major concern, since for practical applications, the constraint qualification 
criteria are almost always satisfied, and the functions themselves are usually con- 
vex and differentiable. Much of the reasoning in this section follows [345], which 
should also be consulted for further references and detail. 

Some of the most important sufficient criteria are the Kuhn-Tucker* saddle point 
conditions [312]. As indicated previously, they are independent of assumptions on 
convexity or differentiability of the constraints c; or objective function f. 


Theorem 6.21 (Kuhn-Tucker Saddle Point Condition [312, 345]) Assume an opti- 
mization problem of the form (6.37), where f : R” — R and c;: R” —> R for i € [n] 
are arbitrary functions, and a Lagrangian 


L(x, a) := f(x) + ¥ aici(x) where a; > 0. (6.39) 
i=1 

If a pair of variables (x, &) with x € R" and &; > 0 for alli € [n] exists, such that for all 

x € R” and a € [0, 00)", 


L(&, a) < L(%, &) < L(x, &) (Saddle Point) (6.40) 
then x is a solution to (6.37). 


The parameters a; are called Lagrange multipliers. As described in the later chap- 
ters, they will become the coefficients in the kernel expansion in SVM. 


Proof The proof follows [345]. Denote by (x, @) a pair of variables satisfying 
(6.40). From the first inequality it follows that 

Ya; — aie) <0. (6.41) 
i=1 


Since we are free to choose a; > 0, we can see (by setting all but one of the terms a; 
to @; and the remaining one to a; = @ +1) that c;(x) < 0 for alli € [n]. This shows 
that x satisfies the constraints, i.e. it is feasible. 


4. An earlier version is due to Karush [283]. This is why often one uses the abbreviation 
KKT (Karush-Kuhn-Tucker) rather than KT to denote the optimality conditions. 
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Additionally, by setting one of the a; to 0, we see that &;c;(¥) > 0. The only way 
to satisfy this is by having 


āici(®) = 0 for all i € [n]. (6.42) 


Eq. (6.42) is often referred to as the KKT condition [283, 312]. Finally, combining 
(6.42) and c;(x) < 0 with the second inequality in (6.40) yields f(x) < f(x) for all 
feasible x. This proves that x is optimal. a 


We can immediately extend Theorem 6.21 to accommodate equality constraints by 
splitting them into the conditions e;(x) < 0 and e;(x) > 0. We obtain: 


Theorem 6.22 (Equality Constraints) Assume an optimization problem of the form 

(6.38), where f,cj;,e;: R” — R for i € [n] and j € [n’] are arbitrary functions, and a 

Lagrangian 

L(x, a) := f(x) + ¥ aici(x) + X, Bje;(x) where a; > 0 and 8; € R. (6.43) 
i=1 j=1 

Ifa set of variables (x, &, 3) with z € R”, & € [0, 00), and B € R” exists such that for all 

x € R”, a € [0, 00)", and BER", 


L(Z, a, B) < L(#, &, B) < L(x, &, D), (6.44) 
then x is a solution to (6.38). 


Now we determine when the conditions of Theorem 6.21 are necessary. We 
will see that convexity and sufficiently “nice” constraints are needed for (6.40) 
to become a necessary condition. The following lemma (see [345]) describes three 
constraint qualifications, which will turn out to be exactly what we need. 


Lemma 6.23 (Constraint Qualifications) Denote by X C R” a convex set, and by 
C1,- -Cn : X 4 R n convex functions defining a feasible region by 


X := {x|x € Xand c(x) < 0 for alli € [n]}. (6.45) 
Then the following additional conditions on c; are connected by (i) <=> (ii) and (iii) => 
(i). 

(i) There exists an x € X such that for alli € [n] c;(x) < 0 (Slater’s condition [500]). 

(ii) For all nonzero a € [0, 00)" there exists an x € X such that X; ajc(x) < 0 (Karlin’s 
condition [281]). 

(iii) The feasible region X contains at least two distinct elements, and there exists an x € X 
such that all c; are strictly convex at x wrt. X (Strict constraint qualification). 


The connection (i) <=> (ii) is also known as the Generalized Gordan Theorem 
[164]. The proof can be skipped if necessary. We need an auxiliary lemma which 
we state without proof (see [345, 435] for details). 
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Figure 6.7 Two hyperplanes (and their nor- 
mal vectors) separating the convex hull of a 
finite set of points from the origin. 


Lemma 6.24 (Separating Hyperplane Theorem) Denote by X € R” a convex set not 
containing the origin 0. Then there exists a hyperplane with normal vector a € R” such 
that a'x > 0 for all x € X. 


See also Figure 6.7. 


Proof of Lemma 6.23. We prove {(i) 4> (ii)} by showing {(i) ==> (ii)} and { not 
(i) => not (ii)}. 


(i) = (ii) For a point x € X with c;(x) < 0, for all i € [n] we have that ajc;(x) > 0 
implies a; = 0. 


(i) => (ii) Assume that there is no x with c;(x) < 0 for all i € [n]. Hence the set 
T := {y|y € R' and there exists some x € X with q; > c;(x) forall i € [n]} (6.46) 


is convex and does not contain the origin. The latter follows directly from the 
assumption. For the former take y, y’ € T and A € (0,1) to obtain 


àgi + (1 — AYA} > Aci(x) + (1 — A)ci(x’) > c(àx + (1 — A)x’). (6.47) 


Now by Lemma 6.24, there exists some a € R” such that a! y > 0 and ||a||? = 1 for 
all y ET. Since each of the y; for y €T can be arbitrarily large (with respect to the 
other coordinates), we conclude a; > 0 for alli € [n]. 

Denote by ô := infyex Xj- a;c;(x) and by 6’ := infyer aly. One can see that by 
construction ô = 6’. By Lemma 6.24 a was chosen such that 6’ > 0, and hence 
ô > 0. This contradicts (ii), however, since it implies the existence of a suitable a 
with ajc;(x) > 0 for all x. 


(iii) ==> (i) Since X is convex we get for all c; and for any A € (0,1): 
Ax + (1 —A)x’ € X and 0 > Aci(x) + (1 — A)e(x’) > c(Ax + (1 — A)x’). (6.48) 


This shows that Ax + (1 — \)x’ satisfies (i) and we are done. 
a 


We proved Lemma 6.23 as it provides us with a set of constraint qualifications 
(conditions on the constraints) that allow us to determine cases where the KKT 
saddle point conditions are both necessary and sufficient. This is important, since 
we will use the KKT conditions to transform optimization problems into their 
duals, and solve the latter numerically. For this approach to be valid, however, we 
must ensure that we do not change the solvability of the optimization problem. 
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Theorem 6.25 (Necessary KKT Conditions [312, 553, 281]) Under the assumptions 
and definitions of Theorem 6.21 with the additional assumption that f and c; are convex 
on the convex set X C R” (containing the set of feasible solutions as a subset) and that c; 
satisfy one of the constraint qualifications of Lemma 6.23, the saddle point criterion (6.40) 
is necessary for optimality. 


Proof Denote by 7 the solution to (6.37), and by X’ the set 
X' := XM {x|x € X with f(x) — f(X) < 0 and c;(x) < 0 for alli € [n]}. (6.49) 


By construction ¥ € X’. Furthermore, there exists no x’ € X’ such that all inequality 
constraints including f(x) — f(x) are satisfied as strict inequalities (otherwise x 
would not be optimal). In other words, X’ violates Slater’s conditions (i) of Lemma 
6.23 (where both (f(x) — f(X)) and c(x) together play the role of c;(x)), and thus also 
Karlin’s conditions (ii). This means that there exists a nonzero vector (@, &) € R+! 
with nonnegative entries such that 


dof (x) — f (X)) + X aic(x) > 0 for all x € X. (6.50) 
i=1 

In particular, for x = X we get D/L, Gici(X) > 0. In addition, since ¥ is a solution to 

(6.37), we have c;(x) < 0. Hence $; Gc;(X) = 0. This allows us to rewrite (6.50) as 


aofa) + È dici(x) > Gof ®) + È aici). (651) 
i=1 i=1 
This looks almost like the first inequality of (6.40), except for the &o term (which 
we will return to later). But let us consider the second inequality first. 

Again, since c;(%) < 0 we have X; ajc;(X) < 0 for all a; > 0. Adding ao f(x) on 
both sides of the inequality and X; &;c;(¥) on the rhs yields 


= 


aiei(2) > of + X aves. (6.52) 


Gof (X) + 
i=1 i=1 


This is almost all we need for the first inequality of (6.40) .> If &o > 0 we can divide 
(6.51) and (6.52) by Go and we are done. 

When Go = 0, then this implies the existence of & € R” with nonnegative entries 
satisfying )j_, &;c;(x) > 0 for all x € X. This contradicts Karlin’s constraint quali- 
fication condition (ii), which allows us to rule out this case. m 


6.3.2 Duality and KKT-Gap 


Now that we have formulated necessary and sufficient optimality conditions (The- 
orem 6.21 and 6.25) under quite general circumstances, let us put them to practical 


5. The two inequalities (6.51) and (6.52) are also known as the Fritz-John saddle point nec- 
essary optimality conditions [269], which play a similar role as the saddle point conditions 
for the Lagrangian (6.39) of Theorem 6.21. 
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use for convex differentiable optimization problems. We first derive a more prac- 
tically useful form of Theorem 6.21. Our reasoning is as follows: eq. (6.40) implies 
that L(x, @) is a saddle point in terms of (X, &). Hence, all we have to do is write the 
saddle point conditions in the form of derivatives. 


Theorem 6.26 (KKT for Differentiable Convex Problems [312]) A solution to the 
optimization problem (6.37) with convex, differentiable f,c; is given by x, if there exists 
some & € R" with a; > 0 for alli € [n] such that the following conditions are satisfied: 


OxL(X, &) = Oxf (X) + > ā&iðxc:(¥) = 0 (Saddle Point in x), (6.53) 
i=1 

Ou; L(X, &) = c;(X) < 0 (Saddle Point in &), (6.54) 

5 āici(¥) = 0 (Vanishing KKT-Gap). (6.55) 


i=1 


Proof The easiest way to prove Theorem 6.26 is to show that for any x € X, we 
have f(x) — f(x) > 0. Due to convexity we may linearize and obtain 


f(x) — f(® > (Oxf (@) "(x - z) (6.56) 
= — Jā; (Oxci(%))' (x — 3) (6.57) 
i=1 
> = ¥ ai(ci(e) —ci() (6.58) 
i=1 
=— 5 āici(x) > 0. (6.59) 
i=l 


Here we used the convexity and differentiability of f to arrive at the rhs of (6.56) 
and (6.58). To obtain (6.57) we exploited the fact that at the saddle point 0; f(X) can 
be replaced by the corresponding expansion in 0,c;(X); thus we used (6.53). Finally, 
for (6.59) we used the fact that the KKT gap vanishes at the optimum (6.55) and 
that the constraints are satisfied (6.54). E 


In other words, we may solve a convex optimization problem by finding (x, &) 
that satisfy the conditions of Theorem 6.26. Moreover, these conditions, together 
with the constraint qualifications of Lemma 6.23, ensure necessity. 

Note that we transformed the problem of minimizing functions into one of 
solving a set of equations, for which several numerical tools are readily available. 
This is exactly how interior point methods work (see Section 6.4 for details on 
how to implement them). Necessary conditions on the constraints similar to those 
discussed previously can also be formulated (see [345] for a detailed discussion). 

The other consequence of Theorem 6.26, or rather of the definition of the La- 
grangian L(x, a), is that we may bound f(z) = L(*, @) from above and below with- 
out explicit knowledge of f (2). 


Theorem 6.27 (KKT-Gap) Assume an optimization problem of type (6.37), where both f 
and c; are convex and differentiable. Denote by X its solution. Then for any set of variables 
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(x, a) with a; > 0, and for alli € [n] satisfying 


a,L(x, a) = 0, (6.60) 
Oo,L(x, a) < 0 for alli € [n], (6.61) 
we have 
fle) > FB) > fl) + Y aci). (6.62) 

i=1 


Strictly speaking, we only need differentiability of f and c; at x. However, since x 
is only known after the optimization problem has been solved, this is not a very 
useful condition. 


Proof ‘The first part of (6.62) follows from the fact that x € X, so that x satisfies 
the constraints. Next note that L(x, &) = f (¥) where (X, &) denotes the saddle point 
of L. For the second part note that due to the saddle point condition (6.40), we 
have for any a with a; > 0, 


F(X) = L@, a) > L(x, a) > inf LQ’, a). (6.63) 


The function L(x’, a) is convex in x' since both f’ and the constraints c; are convex 
and all a; > 0. Therefore (6.60) implies that x minimizes L(x’, a). This proves the 
second part of (6.63), which in turn proves the second inequality of (6.62). E 


Hence, no matter what algorithm we are using in order to solve (6.37), we may 
always use (6.62) to assess the proximity of the current set of parameters to the so- 
lution. Clearly, the relative size of X; ajc;(x) provides a useful stopping criterion 
for convex optimization algorithms. 

Finally, another concept that is useful when dealing with optimization problems 
is that of duality. This means that for the primal minimization problem considered 
so far, which is expressed in terms of x, we can find a dual maximization problem 
in terms of a by computing the saddle point of the Lagrangian L(x, a), and elim- 
inating the primal variables x. We thus obtain the following dual maximization 
problem from (6.37): 


maximize L(x,a) = f(x)+ x aici(x), 
i=l 


x € X,a; > 0 for alli € [n] ) (6.64) 


where (x,a) € Y := < (x,a) 
and 0,L(x, a) = 0 


We state without proof a theorem guaranteeing the existence of a solution to (6.64). 


Theorem 6.28 (Wolfe [607]) Recall the definition of X (6.45) and of the optimization 
problem (6.37). Under the assumptions that X is an open set, X satisfies one of the 
constraint qualifications of Lemma 6.23, and f,c; are all convex and differentiable, there 
exists an & € R” such that (X, a) solves the dual optimization problem (6.64) and in 
addition L(X, &) = f(x). 
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In order to prove Theorem 6.28 we first have to show that some (*,@) exists 
satisfying the KKT conditions, and then use the fact that the KKT-Gap at the saddle 
point vanishes. 


6.3.3 Linear and Quadratic Programs 


Let us analyze the notions of primal and dual objective functions in more detail by 
looking at linear and quadratic programs. We begin with a simple linear setting.® 


T 


minimize c'x 
*. 


(6.65) 
subjectto Ax+d<0 


where c,x € R”, d € R" and A € R"*", and where Ax + d < 0 is a shorthand for 
Xia AijXj +d; <0 for alli € [n]. 

It is far from clear that (6.65) always has a solution, or indeed a minimum. For 
instance, the set of x satisfying Ax +d < 0 might be empty, or it might contain rays 
going to infinity in directions where c! x keeps increasing. Before we deal with this 
issue in more detail, let us compute the sufficient KKT conditions for optimality, 
and the dual of (6.65). We may use (6.26) since (6.65) is clearly differentiable and 
convex. In particular we obtain: 


Theorem 6.29 (KKT Conditions for Linear Programs) A sufficient condition for a 
solution to the linear program (6.65) to exist is that the following four conditions are 
satisfied for some (x, a) E€ R"t" where a > 0: 


AL (x, a) = Oy [er +aT(Ax + a)] =ATa+c=0, (6.66) 
OgL(x,a) = Ax +d < 0, (6.67) 

al (Ax+d)=0, (6.68) 

a>0. (6.69) 


Then the minimum is given by c'x. 


Note that, depending on the choice of A and d, there may not always exist an x 
such that Ax + d < 0, in which case the constraint does not satisfy the conditions 
of Lemma 6.23. In this situation, no solution exists for (6.65). If a feasible x exists, 
however, then (projections onto lower dimensional subspaces aside) the constraint 
qualifications are satisfied on the feasible set, and the conditions above are neces- 
sary. See [334, 345, 555] for details. 


6. Note that we encounter a small clash of notation in (6.65), since c is used as a symbol 
for the loss function in the remainder of the book. This inconvenience is outweighed, 
however, by the advantage of consistency with the standard literature (e.g., [345, 45, 555]) 
on optimization. The latter will allow the reader to read up on the subject without any need 
for cumbersome notational changes. 
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Next we may compute Wolfe’s dual optimization problem by substituting (6.66) 
into L(x, a). Consequently, the primal variables x vanish, and we obtain a maxi- 
mization problem in terms of a only: 


maximize d'a, 


(6.70) 
subjectto Ala+c=Oanda>0. 


Note that the number of variables and constraints has changed: we started with 
m variables and n constraints. Now we have n variables together with m equality 
constraints and inequality constraints. While it is not yet completely obvious in 
the linear case, dualization may render optimization problems more amenable to 
numerical solution (the contrary may be true as well, though). 

What happens if a solution x to the primal problem (6.65) exists? In this case we 
know (since the KKT conditions of Theorem 6.29 are necessary and sufficient) that 
there must be an & solving the dual problem, since L(x, a) has a saddle point at 
(%, 0). 

If no feasible point of the primal problem exists, there must exist, by (a small 
modification of) Lemma 6.23, some a € R” with a > 0 and at least one a; > 0 such 
that a'(Ax + d) > 0 for all x. This means that for all x, the Lagrangian L(x, a) is 
unbounded from above, since we can make a! (Ax + d) arbitrarily large. Hence 
the dual optimization problem is unbounded. Using analogous reasoning, if the 
primal problem is unbounded, the dual problem is infeasible. 

Let us see what happens if we dualize (6.70) one more time. First we need 
more Lagrange multipliers, since we have two sets of constraints. The equality 
constraints can be taken care of by an unbounded variable x’ (see Theorem 6.22 
for how to deal with equalities). For the inequalities a > 0, we introduce a second 
Lagrange multiplier y € R”. After some calculations and resubstitution into the 
corresponding Lagrangian, we get 


Tal 


í 6.71 
subjectto Ax’+d+y=Oandy>0. oy 


maximize c'x 


We can remove y > 0 from the set of variables by transforming Ax’ +d + y into 
Ax +d < 0; thus we recover the primal optimization problem (6.65).” 

The following theorem gives an overview of the transformations and relations 
between primal and dual problems (see also Table 6.2). Although we only derived 
these relations for linear programs, they also hold for other convex differentiable 
settings [45]. 


Theorem 6.30 (Trichotomy) For linear and convex quadratic programs exactly one of 


7. This finding is useful if we have to dualize twice in some optimization settings (see 
Chapter 10), since then we will be able to recover some of the primal variables without 
further calculations if the optimization algorithm provides us with both primal and dual 
variables. 
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Table 6.2 Connections between primal and dual linear and convex quadratic programs. 


Primal Optimization Problem (in x) Dual Optimization Problem (in a) 
solution exists solution exists and extrema are equal 


no solution exists maximization problem has unbounded 
objective from above or is infeasible 

minimization problem has unbounded | no solution exists 

objective from below or is infeasible 


inequality constraint inequality constraint 
equality constraint 
free variable equality constraint 


the following three alternatives must hold: 


1. Both feasible regions are empty. 


2. Exactly one feasible region is empty, in which case the objective function of the other 
problem is unbounded in the direction of optimization. 


3. Both feasible regions are nonempty, in which case both problems have solutions and 
their extrema are equal. 


We conclude this section by stating primal and dual optimization problems, and 
the sufficient KKT conditions for convex quadratic optimization problems. To 
keep matters simple we only consider the following type of optimization problem 
(other problems can be rewritten in the same form; see Problem 6.11 for details): 
minimize $x'Kx+c'x, 
x (6.72) 
subjectto Ax+d <0. 


Here K is a strictly positive definite matrix, x,c € R”, A € R'*”, and d € R”. Note 
that this is clearly a differentiable convex optimization problem. To introduce a 
Lagrangian we need corresponding multipliers a € R" with a > 0. We obtain 


L(x,a) = sit Kx t+e'xta!(Ax+d). (6.73) 


Next we may apply Theorem 6.26 to obtain the KKT conditions. They can be stated 
in analogy to (6.66)-(6.68) as 


1 
L(x, 2) = ôx |c'x +a (Ax +d) + 5x Kx = Kx+A'at+c=0, (6.74) 
AaL(x,0) = Ax +d <0, (6.75) 
a'(Ax+d)=0, (6.76) 
) 


a>0o. (6.77 
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In order to compute the dual of (6.72), we have to eliminate x from (6.73) and write 
it as a function of a. We obtain 


1 

L(x,a) = — 5x Kx t+ald (6.78) 
1 

= -30 ATK "Aa + |a = TRAATI oe LoT Ke. (6.79) 


In (6.78) we used (6.74) and (6.76) directly, whereas in order to eliminate x com- 
pletely in (6.79) we solved (6.74) for x = —K7!(c + A! a). Ignoring constant terms 
this leads to the dual quadratic optimization problem, 
im; T oramg TK-1 4T 
-za A K` A d—c K 
minimize z2 a+ | c A | a, (6.80) 
subjectto a>0. 


The surprising fact about the dual problem (6.80) is that the constraints become 
significantly simpler than in the primal (6.72). Furthermore, if n < m, we also 
obtain a more compact representation of the quadratic term. 

There is one aspect in which (6.80) differs from its linear counterpart (6.70): if 
we dualize (6.80) again, we do not recover (6.72) but rather a problem very similar 
in structure to (6.80). Dualizing (6.80) twice, however, we recover the dual itself 
(Problem 6.13 deals with this matter in more detail). 


6.4 Interior Point Methods 


Let us now have a look at simple, yet efficient optimization algorithms for con- 
strained problems: interior point methods. 

An interior point is a pair of variables (x, a) that satisfies both primal and dual 
constraints. As already mentioned before, finding a set of vectors (*, @) that satisfy 
the KKT conditions is sufficient to obtain a solution in x. Hence, all we have to do 
is devise an algorithm which solves (6.74)-(6.77), for instance, if we want to solve 
a quadratic program. We will focus on the quadratic case — the changes required 
for linear programs merely involve the removal of some variables, simplifying the 
equations. See Problem 6.14 and [555, 517] for details. 


6.4.1 Sufficient Conditions for a Solution 
We need a slight modification of (6.74)-(6.77) in order to achieve our goal: rather 


than the inequality (6.75), we are better off with an equality and a positivity 
constraint for an additional variable, i.e. we transform Ax +d <Ointo Ax +d + é = 
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0, where € > 0. Hence we arrive at the following system of equations: 


Kx+Ala+c = 0 (Dual Feasibility), 

Ax+d+€ = 0 (Primal Feasibility), (6.81) 
alé 0, 

a, E > 0. 


Let us analyze the equations in more detail. We have three sets of variables: x, a, €. 
To determine the latter, we have an equal number of equations plus the positivity 
constraints on a, é. While the first two equations are linear and thus amenable to 
solution, e.g., by matrix inversion, the third equality a! €=0 has a small defect: 
given one variable, say a, we cannot solve it for € or vice versa. Furthermore, the 
last two constraints are not very informative either. 

We use a primal-dual path-following algorithm, as proposed in [556], to solve 
this problem. Rather than requiring a' € = 0 we modify it to become ajé; = p > 0 
for all i € [n], solve (6.81) for a given u, and decrease u to 0 as we go. The 
advantage of this strategy is that we may use a Newton-type predictor corrector 
algorithm (see Section 6.2.5) to update the parameters x, a, €, which exhibits the 
fast convergence of a second order method. 


6.4.2 Solving the Equations 
For the moment, assume that we have suitable initial values of x,a,&, and p 


with a,€ > 0. Linearization of the first three equations of (6.81), together with 
the modification a;£; = ju, yields (we expand x into x + Ax, etc.): 


KAx+AlAa = —Kx—-—Ala-c =. py 
AAx + AE = -—Ax-d-€& =: pa, (6.82) 
ay éAaj +A = pay!—&—ajz'AajAg; =: pxxr; for alli 
Next we solve for AŻ; to obtain what is commonly referred to as the reduced KKT 
system. For convenience we use D := diag(a7'f1,..., a7" En) as a shorthand; 
K Al A 
a | | Pe l (6.83) 
A -D Aa Pd — PKKT 


We apply a predictor-corrector method as in Section 6.2.5. The resulting matrix of 
the linear system in (6.83) is indefinite but of full rank, and we can solve (6.83) for 
(AxPrea; AQPrea) by explicitly pivoting for individual entries (for instance, solve for 
Ax first and then substitute the result in to the second equality to obtain Aq). 

This gives us the predictor part of the solution. Next we have to correct for the 
linearization, which is conveniently achieved by updating pxxr and solving (6.83) 
again to obtain the corrector values (AXcorr, AQCorr). The value of A€ is then obtained 
from (6.82). 
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Next, we have to make sure that the updates in a, é do not cause the estimates 
to violate their positivity constraints. This is done by shrinking the length of 
(Ax, Aa, A€) by some factor A > 0, such that 


aes (= + \Aay Ont Man & + AAG x En + — SE 

a1 rs rs Qn A é ? ? En = 
Of course, only the negative A terms pose a problem, since they lead the param- 
eter values closer to 0, which may lead them into conflict with the positivity con- 
straints. Typically [556, 502], we choose e = 0.05. In other words, the solution will 
not approach the boundaries in a,€ by more than 95%. See Problem 6.15 for a 
formula to compute A. 


(6.84) 


6.4.3 Updating u 


Next we have to update u. Here we face the following dilemma: if we decrease 
u too quickly, we will get bad convergence of our second order method, since 
the solution to the problem (which depends on the value of u) moves too quickly 
away from our current set of parameters (x, a, é). On the other hand, we do not 
want to spend too much time solving an approximation of the unrelaxed (u = 0) 
KKT conditions exactly. A good indication is how much the positivity constraints 
would be violated by the current update. Vanderbei [556] proposes the following 
update of u: 


ae pissy 
Woe (ay es 
The first term gives the average value of satisfaction of the condition ajé; = u 
after an update step. The second term allows us to decrease u rapidly if good 
progress was made (small (1 — \)*). Experimental evidence shows that it pays to 
be slightly more conservative, and to use the predictor estimates of a, € for (6.85) 
rather than the corresponding corrector terms.® This imposes little overhead for 
the implementation. 


6.4.4 Initial Conditions and Stopping Criterion 


To provide a complete algorithm, we have to consider two more things: a stopping 
criterion and a suitable start value. For the latter, we simply solve a regularized 
version of the initial reduced KKT system (6.83). This means that we replace K by 
K +1, use (x, a) in place of Ax, Aa, and replace D by the identity matrix. Moreover, 
Pp and pq are set to the values they would have if all variables had been set to 0 
before, and finally pxxr is set to 0. In other words, we obtain an initial guess of 


8. In practice it is often useful to replace (1 — A) by (1 + € — à) for some small e > 0, in order 
to avoid u = 0. 
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(x, a, €) by solving 


K+1 At — 
= © Wize) 2 ll (6.86) 
A —1 a —d 
and € = —Ax — d. Since we have to ensure positivity of a, é, we simply replace 
Qi = max(a;, 1) and é; = max(&;, 1). (6.87) 


This heuristic solves the problem of a suitable initial condition. 

Regarding the stopping criterion, we recall Theorem 6.27, and in particular 
(6.62). Rather than obtaining bounds on the precision of parameters, we want 
to make sure that f(x) is close to its optimal value f(¥). From (6.64) we know, 
provided the feasibility constraints are all satisfied, that the value of the dual 
objective function is given by f(x) + X; a;c;(x). We may use the latter to bound 
the relative size of the gap between primal and dual objective function by 


fo) = (f+ È acw )| 
real + |(ra)+ 3 acw) 


2 


n 
= 2 aci(x) 
< = 


Gap(x, a) = < (6.88) 


fla) +3 ¥ aves] 


For the special case where f(x) = $x'Kx +c'x as in (6.72), we know by virtue of 
(6.73) that the size of the feasibility gap is given by a €, and therefore 
alg 


Gap(x, a) = ———_—>—___.. 
PO, 0) |axtKx+clx+ sare 


(6.89) 


In practice, a small number is usually added to the denominator of (6.89) in order 
to avoid divisions by 0 in the first iteration. The quality of the solution is typically 
measured on a logarithmic scale by —log,, Gap(x, a), the number of significant 
figures.? We will come back to specific versions of such interior point algorithms in 
Chapter 10, and show how Support Vector Regression and Classification problems 
can be solved with them. 

Primal-Dual path following methods are certainly not the only algorithms that 
can be employed for minimizing constrained quadratic problems. Other variants, 
for instance, are Barrier Methods [282, 45, 557], which minimize the unconstrained 
problem 


f(x) + uÝ, fn (-ci(x)) for > 0. (6.90) 


i=1 
Active set methods have also been used with success in machine learning [369, 
284]. These select subsets of variables x for which the constraints c; are not ac- 


9. Interior point codes are very precise. They usually achieve up to 8 significant figures, 
whereas iterative approximation methods do not normally exceed more than 3 significant 
figures on large optimization problems. 
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tive, i.e., where the we have a strict inequality, and solve the resulting restricted 
quadratic program, for instance by conjugate gradient descent. We will encounter 
subset selection methods in Chapter 10. 


6.5 Maximum Search Problems 


Approximations 


In several cases the task of finding an optimal function for estimation purposes 
means finding the best element from a finite set, or sometimes finding an optimal 
subset from a finite set of elements. These are discrete (sometimes combinatorial) 
optimization problems which are not so easily amenable to the techniques pre- 
sented in the previous two sections. Furthermore, many commonly encountered 
problems are computationally expensive if solved exactly. Instead, by using prob- 
abilistic methods, it is possible to find almost optimal approximate solutions. These 
probabilistic methods are the topic of the present section. 


6.5.1 Random Subset Selection 


Consider the following problem: given a set of m functions, say M := {fi,..., fm}, 

and some criterion Q[f], find the function f that maximizes Q[f]. More formally, 

f= argmax Q[f]. (6.91) 
fEM 


Clearly, unless we have additional knowledge about the values Q[fi], we have 
to compute all terms Q[f;] if we want to solve (6.91) exactly. This will cost O(m) 
operations. If m is large, which is often the case in practical applications, this 
operation is too expensive. In sparse greedy approximation problems (Section 
10.2) or in Kernel Feature Analysis (Section 14.4), m can easily be of the order of 
10° or larger (here, m is the number of training patterns). Hence we have to look 
for cheaper approximate solutions. 

The key idea is to pick a random subset M’ C M that is sufficiently large, 
and take the maximum over M’ as an approximation of the maximum over M. 
Provided the distribution of the values of Q[ fi] is “well behaved”, i.e., there exists 
not a small fraction of Q[ fi] whose values are significantly smaller or larger than 
the average, we will obtain a solution that is close to the optimum with high 
probability. To formalize these ideas, we need the following result. 


Lemma 6.31 (Maximum of Random Variables) Denote by £,€' two independent 
random variables on R with corresponding distributions Pe,Pẹ and distribution func- 


tions Fe,Fe. Then the random variable € := max(€, &’) has the distribution function 
Fz = Fe Fe. 


Proof Note that for a random variable, the distribution function F(€) is given by 
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the probability P{E < £o}. Since € and €’ are independent, we may write 


F(E) =P {max(€, 6) < E} =P {E < and ¢' < E} =P{E<E}P {E <E) 
= Fe(€)Fe(€), (6.92) 
which proves the claim. = 


Repeated application of Lemma 6.31 leads to the following corollary. 


Corollary 6.32 (Maximum Over Identical Random Variables) Let €1,...,&% be m 
independent and identically distributed (tid) random variables, with corresponding distri- 
bution function Fz. Then the random variable E := max(&,...,&n) has the distribution 


function Fe(&) = (Fe(@)". 


In practice, the random variables €; will be the values of Q[f;], where the f; are 
drawn from the set M. If we draw them without replacement (i.e. none of the func- 
tions f; appears twice), however, the values after each draw are dependent and we 
cannot apply Corollary 6.32 directly. Nonetheless, we can see that the maximum 
over draws without replacement will be larger than the maximum with replace- 
ment, since recurring observations can be understood as reducing the effective 
size of the set to be considered. Thus Corollary 6.32 gives us a lower bound on the 
value of the distribution function for draws without replacement. Moreover, for 
large m the difference between draws with and without replacement is small. 

If the distribution of Q[fj] is known, we may use the distribution directly to 
determine the size ™ of a subset to be used to find some Q[f;] that is almost as 
good as the solution to (6.91). In all other cases, we have to resort to assessing the 
relative quality of maxima over subsets. The following theorem tells us how. 


Theorem 6.33 (Ranks on Random Subsets) Denote by M := {x1,...,Xm}C Ra set 
of cardinality m, and by M C Ma random subset of size ñ. Then the probability that 


max M is greater equal than n elements of M is at least 1 — (4)". 


Proof We prove this by assuming the converse, namely that max M is smaller 
than (m — n) elements of M. For ñ = 1 we know that this probability is 4, since 
there are n elements to choose from. For m > 1, the probability is the one of 
choosing 17 elements out of a subset Miow of n elements, rather than all m elements. 
Therefore we have that 


n—1 n—-m+1 a 


n 
m m—1 `` m-ñ+1 


P(M C Miow) = i), = p 


Consequently the probability that the maximum over M will be larger than n 
elements of M is given by 1 — P(M C Mw) > 1 — (4)”. a 


m 


The practical consequence is that we may use 1 — EM to compute the required 


size of a random subset to achieve the desired degree of approximation. If we 
want to obtain results in the 4 percentile range with 1 — 7 confidence, we must 
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solve for m = ee. To give a numerical example, if we desire values that 


are better than 95% of all other estimates with 1 — 0.05 probability, then « = 59 
samples are sufficient. This (95%, 95%, 59) rule is very useful in practice.!0 A 
similar method was used to speed up the process of boosting classifiers in the 
MadaBoost algorithm [143]. Furthermore, one could think whether it might not 
be useful to recycle old observations rather than computing all 59 values from 
scratch. If this can be done cheaply, and under some additional independence 
assumptions, subset selection methods can be improved further. For details see 
[424] who use the method in the context of memory management for operating 
systems. 


6.5.2 Random Evaluation 


Quite often, the evaluation of the term Q[f] itself is rather time consuming, es- 
pecially if Q[f] is the sum of many (m, for instance) iid random variables. Again, 
we can speed up matters considerably by using probabilistic methods. The key 
idea is that averages over independent random variables are concentrated, which 
is to say that averages over subsets do not differ too much from averages over the 
whole set. 

Hoeffding’s Theorem (Section 5.2) quantifies the size of the deviations between 
the expectation of a sum of random variables and their values at individual trials. 
We will use this to bound deviations between averages over sets and subsets. All 
we have to do is translate Theorem 5.1 into a statement regarding sample averages 
over different sample sizes. This can be readily constructed as follows: 


Corollary 6.34 (Deviation Bounds for Empirical Means [508]) Suppose €),...,&n 
are iid bounded random variables, falling into the interval [a,a + b] with probability one. 
Denote their average by Qm = 4 È; &. Furthermore, denote by Esa), - - - , Esci with m < m 
a subset of the same random variables (with s : {1,...,m} — {1,...,m} being an injec- 
tive map, i.e. s(i) = s(j) only if i = j), and Qm = $ Yj. Then for any e > 0, 


P Qn = Qn > E Pid. 2 m 
t } < exp Enn ) = exp Gas = z) (6.93) 
P{Qr = Qn > E} ra He 


m 


Proof By construction E [Qm — Qm] = 0, since Qm and Qr, are both averages over 
sums of random variables drawn from the same distribution. Hence we only have 
to rewrite Qm — Qm as an average over (different) random variables to apply 
Hoeffding’s bound. Since all Q; are identically distributed, we may pick the first 
m random variables, without loss of generality. In other words, we assume that 


10. During World War I tanks were often numbered in continuous increasing order. Unfor- 
tunately this “feature” allowed the enemy to estimate the number of tanks. How? 
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s(i) =ifori=1,...,m. Then 
m mñ Ñ m 
Qm- Qn =4 YN E-4¥NG=4¥ 0-4)64+H 5 &j. (6.94) 
i=1 i=1 i=1 i=m+1 
Thus we may split up Qm — Qm into a sum of m random variables with range 
b; = (4 — 1)b, and m — m random variables with range b; = b. We obtain 


m F 

Yo? = Pm (Z1) + 0n- mb = (n — m=. (6.95) 

i=] m 

Substituting this into (5.7) and noting that Qm — Qm — E[Qm — Qm] = Qm — Qm 

completes the proof. E 
_ 2ñe? 


For small a the rhs in (6.93) reduces to exp ( ae), In other words, deviations on 


the subsample 7 dominate the overall deviation of Qm — Qj, from 0. This allows 
us to compute a cutoff criterion for evaluating Qm by computing only a subset of 
its terms. 
We need only solve (6.93) for a Hence, in order to ensure that Qm is within € of 
m 


Qnm with probability 1 — ņ, we have to take a fraction “ of samples that satisfies 


m 


ui b?(In2 —1 1 
"= A =: c, and therefore an (6.96) 
-2 2me? m 1+c 


The fraction Ž can be small for large m, which is exactly the case where we need 


methods to speed up evaluation. 
6.5.3 Greedy Optimization Strategies 


Quite often the overall goal is not necessarily to find the single best element x; from 
a set X to solve a problem, but to find a good subset X C X of size m according to 
some quality criterion Q[X]. Problems of this type include approximating a matrix 
by a subset of its rows and columns (Section 10.2), finding approximate solutions 
to Kernel Fisher Discriminant Analysis (Chapter 15) and finding a sparse solution 
to the problem of Gaussian Process Regression (Section 16.3.4). These all have a 
common structure: 


(i) Finding an optimal set X C X is quite often a combinatorial problem, or it even 
may be NP-hard, since it means selecting 17 = |X| elements from a set of m = |X| 
elements. There are ('") different choices, which clearly prevents an exhaustive 
search over all of them. Additionally, the size of f is often not known beforehand. 
Hence we need a fast approximate algorithm. 

(ii) The evaluation of Q[X U {x;}] is inexpensive, provided Q[X] has been com- 
puted before. This indicates that an iterative algorithm can be useful. 

(iii) The value of Q[X], or equivalently how well we would do by taking the 
whole set X, can be bounded efficiently by using Q[X] (or some by-products of 
the computation of Q[M]) without actually computing Q[X]. 


6.6 Summary 


Iterative 
Enlargement of X 


6.6 Summary 
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Algorithm 6.6 Sparse Greedy Algorithm 
Require: Set of functions X, Precision e, Criterion Q[-] 
Set X = ff) 
repeat 
Choose random subset X’ of size m’ from X\X. 
Pick f = argmax „ex QIX’ U {x} 
X'=X'U{$} 
If needed, (re)compute bound on Q[X]. 
until Q[X] + €> Bound on Q[X] 
Output: X, Q[X] 


(iv) The set of functions X is typically very large (i.e. more than 10° elements), 
yet the individual improvements by f; via Q[X U {x;}] do not differ too much, 
meaning that specific x; for which Q[X U {x;}] deviate by a large amount from the 
rest of Q[X U {x;}] do not exist. 


In this case we may use a sparse greedy algorithm to find near optimal solutions 
among the remaining X\X elements. This combines the idea of an iterative en- 
largement of X by one more element at a time (which is feasible since we can 
compute Q[X U {f;}] cheaply) with the idea that we need not consider all f; as 
possible candidates for the enlargement. This uses the reasoning in Section 6.5.1 
combined with the fact that the distribution of the improvements is not too long 
tailed (cf. (iv)). The overall strategy is described in Algorithm 6.6. 
Problems 6.9 and 6.10 contain more examples of sparse greedy algorithms. 


This chapter gave an overview of different optimization methods, which form the 
basic toolbox for solving the problems arising in learning with kernels. The main 
focus was on convex and differentiable problems, hence the overview of properties 
of convex sets and functions defined on them. 

The key insights in Section 6.1 are that convex sets can be defined by level sets of 
convex functions and that convex optimization problems have one global minimum. 
Furthermore, the fact that the solutions of convex maximization over polyhedral 
sets can be found on the vertices will prove useful in some unsupervised learning 
applications (Section 14.4). 

Basic tools for unconstrained problems (Section 6.2) include interval cut- 
ting methods, the Newton method, Conjugate Gradient descent, and Predictor- 
Corrector methods. These techniques are often used as building blocks to solve 
more advanced constrained optimization problems. 

Since constrained minimization is a fairly complex topic, we only presented a 
selection of fundamental results, such as necessary and sufficient conditions in 
the general case of nonlinear programming. The KKT conditions for differentiable 
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convex functions then followed immediately from the previous reasoning. The 
main results are dualization, meaning the transformation of optimization prob- 
lems via the Lagrangian mechanism into possibly simpler problems, and that op- 
timality properties can be estimated via the KKT gap (Theorem 6.27). 

Interior point algorithms are practical applications of the duality reasoning; 
these seek to find a solution to optimization problems by satisfying the KKT opti- 
mality conditions. Here we were able to employ some of the concepts introduced 
at an earlier stage, such as predictor corrector methods and numerical ways of 
finding roots of equations. These algorithms are robust tools to find solutions 
on moderately sized problems (103 — 10* examples). Larger problems require de- 
composition methods, to be discussed in Section 10.4, or randomized methods. 
The chapter concluded with an overview of randomized methods for maximiz- 
ing functions or finding the best subset of elements. These techniques are useful 
once datasets are so large that we cannot reasonably hope to find exact solutions 
to optimization problems. 


6.7 Problems 


6.1 (Level Sets e) Given the function f : R > R with f(x) := |x |P + |x2|P, for which p 
do we obtain a convex function? 

Now consider the sets {x|f(x) < c} for some c > 0. Can you give an explicit 
parametrization of the boundary of the set? Is it easier to deal with this parametrization? 
Can you find other examples (see also [489] and Chapter 8 for details)? 


6.2 (Convex Hulls e) Show that for any set X, its convex hull co X is convex. Further- 
more, show that co X = X if X is convex. 


6.3 (Method of False Position [334] eee) Given a unimodal (possessing one mini- 
mum) differentiable function f : R — R, develop a quadratic method for minimizing 


Hint: Recall the Newton method. There we used f"(x) to make a quadratic approxima- 
tion of f. Two values of f'(x) are also sufficient to obtain this information, however. 

What happens if we may only use f? What does the iteration scheme look like? See 
Figure 6.8 for a hint. 


6.4 (Convex Minimization in one Variable ee) Denote by f a convex function on 
[a,b]. Show that the algorithm below finds the minimum of f. What is the rate of 
convergence in x to argmin, f(x)? Can you obtain a bound in f(x) wrt. min, f(x)? 


input a,b, f and threshold £ 
Xi =4,xX2 = at? x3 = band compute f (x1), f(x2), f(x3) 
repeat 
if x3 — X2 > X2 — Xı then 


6.7 Problems 


185 
x4 = = and compute f (x4) 
else 
x4 = 45% and compute f (x4) 
end if 


Keep the two points closest to the point with the minimum value of f(x;) and rename 
them such that xı < Xo < x3. 
until x3 — xı > € 


6.5 (Newton Method in R° ee) Extend the Newton method to functions on R. What 
does the iteration rule look like? Under which conditions does the algorithm converge? Do 
you have to extend Theorem 6.13 to prove convergence? 


6.6 (Rewriting Quadratic Functionals e) Given a function 
f(x)=x'Qx+e'x4+d, (6.97) 


rewrite it into the form of (6.18). Give explicit expressions for x* = argmin , f(x) and the 
difference in the additive constants. 


6.7 (Kantorovich Inequality [278] eee) Prove Theorem 6.16. Hint: note that without 
loss of generality we may require ||x||? = 1. Second, perform a transformation of coordi- 
nates into the eigensystem of K. Finally, note that in the new coordinate system we are 
dealing with convex combinations of eigenvalues A; and x First show (6.24) for only two 
eigenvalues. Then argue that only the largest and smallest eigenvalues matter. 


6.8 (Random Subsets e) Generate m random numbers drawn uniformly from the inter- 
val [0,1]. Plot their distribution function. Plot the distribution of maxima of subsets of 
random numbers. What can you say about the distribution of the maxima? What happens 
if you draw randomly from the Laplace distribution, with density p(€) = e~§ (for € > 0)? 


6.9 (Matching Pursuit [342] ee) Denote by fi,..., fm a set of functions X + R, by 
{x1,...,;Xm} C X a set of locations and by {y1,..., Ym} C Y a set of corresponding 
observations. 

Design a sparse greedy algorithm that finds a linear combination of functions f := 
X; afi minimizing the squared loss between f(x;) and yj. 
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Figure 6.8 From left to right: Newton method, method of false position, quadratic inter- 
polation through 3 points. Solid line: f(x), dash-dotted line: interpolation. 
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6.10 (Reduced Set Approximation [474] ee) Let f(x) = DY, ajk(x;, x) be a kernel ex- 
pansion in a Reproducing Kernel Hilbert Space Hy (see Section 2.2.3). Give a sparse greedy 
algorithm that finds an approximation to f in Hy by using fewer terms. See also Chapter 
18 for more detail. 


6.11 (Equality Constraints in LP and QP ee) Find the dual optimization problem and 
the necessary KKT conditions for the following optimization problem: 


minimize c!x, 
subject to Ax+b<0, (6.98) 
Cx+d=0, 


where c,x € R", b € R', d € R”, A € R"*” and C € R". Hint: split up the equality 
constraints into two inequality constraints. Note that you may combine the two Lagrange 
multipliers again to obtain a free variable. Derive the corresponding conditions for 


minimize 5x'Kx+c'x, 
subject to Ax+b<0, (6.99) 
Cx+d=0, 


where K is a strictly positive definite matrix. 


6.12 (Not Strictly Definite Quadratic Parts eee) How do you have to change the dual 
of (6.99) if K does not have full rank? Is it better not to dualize in this case? Do the KKT 
conditions still hold? 


6.13 (Dual Problems of Quadratic Programs ee) Denote by P a quadratic optimiza- 
tion problem of type (6.72) and by (-)? the dualization operation. Prove that the following 
is true, 


((P?)?)? = P? and (((P?)?)?)? = (P”)”, (6.100) 


where in general (P?)? # P. Hint: use (6.80). Caution: you have to check whether KAT 
has full rank. 


6.14 (Interior Point Equations for Linear Programs [336] eee) Derive the interior 

point equations for linear programs. Hint: use the expansions for the quadratic programs 

and note that the reduced KKT system has only a diagonal term where we had K before. 
How does the complexity of the problem scale with the size of A? 


6.15 (Update Step in Interior Point Codes e) Show that the maximum value of A sat- 
isfying (6.84) can be found by 


Taia (1e = D7 min ĈE, (e= D7 min) (6.101) 
À ic[n] Qi i€[n] É; 


II SUPPORT VECTOR MACHINES 


The algorithms for constructing the separating hyperplane considered above will be utilized 
for developing a battery of programs for pattern recognition. 
V. N. Vapnik [560, p. 364] 


Now that we have the necessary concepts and tools, we move on to the class of 
Support Vector (SV) algorithms. SV algorithms are commonly considered the first 
practicable spin-off of statistical learning theory. We described the basic ideas of 
Support Vector machines (SVMs) in Chapter 1. It is now time for a much more 
detailed discussion and description of SVMs, starting with the case of pattern 
recognition (Chapter 7), which was historically the first to be developed. 

Following this, we move on to a problem that can actually be considered as 
being even simpler than pattern recognition. In pattern recognition, we try to 
distinguish between patterns of at least two classes; in single-class classification 
(Chapter 8), however, there is only one class. In the latter case, which belongs to 
the realm of unsupervised learning, we try to learn a model of the data which 
describes, in a weak sense, what the training data looks like. This model can then 
be used to assess the “typicality” or novelty of previously unseen patterns, a task 
which is rather useful in a number of application domains. 

Chapter 9 introduces SV algorithms for regression estimation. These retain most 
of the properties of the other SV algorithms, with the exception that in the regres- 
sion case, the choice of the loss function, as described in Chapter 3, becomes a 
more interesting issue. 
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After this, we give details on how to implement the various types of SV al- 
gorithms (Chapter 10), and we describe some methods for incorporating prior 
knowledge about invariances of a given problem into SVMs (Chapter 11). 

We conclude this part of the book by revisiting statistical learning theory, this 
time with a much stronger emphasis on elements that are specific to SVMs and 
kernel methods (Chapter 12). 


Overview 


Prerequisites 


Pattern Recognition 


This chapter is devoted to a detailed description of SV classification (SVC) meth- 
ods. We have already briefly visited the SVC algorithm in Chapter 1. There will be 
some overlap with that chapter, but here we give a more thorough treatment. 

We start by describing the classifier that forms the basis for SVC, the separating 
hyperplane (Section 7.1). Separating hyperplanes can differ in how large a margin 
of separation they induce between the classes, with corresponding consequences 
on the generalization error, as discussed in Section 7.2. The “optimal” margin hy- 
perplane is defined in Section 7.3, along with a description of how to compute it. 
Using the kernel trick of Chapter 2, we generalize to the case where the optimal 
margin hyperplane is not computed in input space, but in a feature space nonlin- 
early related to the latter (Section 7.4). This dramatically increases the applicability 
of the approach, as does the introduction of slack variables to deal with outliers 
and noise in the data (Section 7.5). Many practical problems require us to classify 
the data into more than just two classes. Section 7.6 describes how multi-class SV 
classification systems can be built. Following this, Section 7.7 describes some vari- 
ations on standard SV classification algorithms, differing in the regularizers and 
constraints that are used. We conclude with a fairly detailed section on experi- 
ments and applications (Section 7.8). 

This chapter requires basic knowledge of kernels, as conveyed in the first half 
of Chapter 2. To understand details of the optimization problems, it is helpful (but 
not indispensable) to get some background from Chapter 6. To understand the 
connections to learning theory, in particular regarding the statistical basis of the 
regularizer used in SV classification, it would be useful to have read Chapter 5. 


7.1 Separating Hyperplanes 


Hyperplane 


Suppose we are given a dot product space H, and a set of pattern vectors 
X1,---;Xm E€ H. Any hyperplane in H can be written as 


{x € H| (w,x)+b=0}, wEH, DER (7.1) 


In this formulation, w is a vector orthogonal to the hyperplane: If w has unit 
length, then (w,x) is the length of x along the direction of w (Figure 7.1). For 
general w, this number will be scaled by ||w||. In any case, the set (7.1) consists 
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of vectors that all have the same length along w. In other words, these are vectors 
that project onto the same point on the line spanned by w. 

In this formulation, we still have the freedom to multiply w and b by the same 
non-zero constant. This superfluous freedom — physicists would call it a “gauge” 
freedom — can be abolished as follows. 


Definition 7.1 (Canonical Hyperplane) The pair (w,b) € H x Ris called a canonical 
form of the hyperplane (7.1) with respect to x1,...,Xm € K, if it is scaled such that 


min |(w,x;)+b|=1, (7.2) 
T= Leos 


which amounts to saying that the point closest to the hyperplane has a distance of 1/||w|| 
(Figure 7.2). 


Note that the condition (7.2) still allows two such pairs: given a canonical hyper- 
plane (w, b), another one satisfying (7.2) is given by (—w, —b). For the purpose of 
pattern recognition, these two hyperplanes turn out to be different, as they are 
oriented differently; they correspond to two decision functions, 


f wb: H > {+1} 
x> fw,o(x) = sgn ((w, x) + b) ; (7.3) 
which are the inverse of each other. 
In the absence of class labels y; € {+1} associated with the x;, there is no way 


of distinguishing the two hyperplanes. For a labelled dataset, a distinction exists: 
The two hyperplanes make opposite class assignments. In pattern recognition, 
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{x | <w, x> +b = 0} 


Figure 7.1 A separable classification problem, along with a separating hyperplane, written 
in terms of an orthogonal weight vector w and a threshold b. Note that by multiplying both 
w and b by the same non-zero constant, we obtain the same hyperplane, represented in 
terms of different parameters. Figure 7.2 shows how to eliminate this scaling freedom. 


{x | <w,x>+b = Note: 
<w x,>+b=+1 
Co <w, X)>+b ==] 
i => <w: (x-x% D= 2 
ea TE ene 
aes => iwi? 1-2) = iwi 
ol.. 


{x | <w, x> +b = 0} 


Figure 7.2 By requiring the scaling of w and b to be such that the point(s) closest to the 
hyperplane satisfy | (w, x;) + b| = 1, we obtain a canonical form (w, b) of a hyperplane. Note 
that in this case, the margin, measured perpendicularly to the hyperplane, equals 1/||w\|. 
This can be seen by considering two opposite points which precisely satisfy | (w,x;) + b|=1 
(cf. Problem 7.4) 


we attempt to find a solution fwp which correctly classifies the labelled examples 
(xi, yi) E H x {+1}; in other words, which satisfies fy »(x;) = y; for all i (in this 
case, the training set is said to be separable), or at least for a large fraction thereof. 

The next section will introduce the term margin, to denote the distance to a sep- 
arating hyperplane from the point closest to it. It will be argued that to generalize 
well, a large margin should be sought. In view of Figure 7.2, this can be achieved 
by keeping ||w|| small. Readers who are content with this level of detail may skip 
the next section and proceed directly to Section 7.3, where we describe how to 
construct the hyperplane with the largest margin. 
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7.2 The Role of the Margin 


Geometrical 
Margin 


Margin of 
Canonical 
Hyperplanes 


Insensitivity to 
Pattern Noise 


The margin plays a crucial role in the design of SV learning algorithms. Let us start 
by formally defining it. 


Definition 7.2 (Geometrical Margin) For a hyperplane {x € H| (w,x) +b = 0}, we 
call 


Pow, p(x, y) := y((w, x) + b)/||wI| (74) 


the geometrical margin of the point (x, y) € H x {£1}. The minimum value 
Piw) = min Pwal Yi) (7.5) 


shall be called the geometrical margin of (x1, y1), - - - , (Xm, Ym). If the latter is omitted, it 
is understood that the training set is meant. 


Occasionally, we will omit the qualification geometrical, and simply refer to the 
margin. 

For a point (x, y) which is correctly classified, the margin is simply the distance 
from x to the hyperplane. To see this, note first that the margin is zero on the 
hyperplane. Second, in the definition, we effectively consider a hyperplane 


(W, b) := (w/||w||,b/||wI), (7.6) 


which has a unit length weight vector, and then compute the quantity y((w, x) +5). 
The term (W,x), however, simply computes the length of the projection of x onto 
the direction orthogonal to the hyperplane, which, after adding the offset b, equals 
the distance to it. The multiplication by y ensures that the margin is positive 
whenever a point is correctly classified. For misclassified points, we thus get a 
margin which equals the negative distance to the hyperplane. Finally, note that 
for canonical hyperplanes, the margin is 1/||w|| (Figure 7.2). The definition of 
the canonical hyperplane thus ensures that the length of w now corresponds to 
a meaningful geometrical quantity. 

It turns out that the margin of a separating hyperplane, and thus the length of 
the weight vector w, plays a fundamental role in support vector type algorithms. 
Loosely speaking, if we manage to separate the training data with a large margin, 
then we have reason to believe that we will do well on the test set. Not surprisingly, 
there exist a number of explanations for this intuition, ranging from the simple to 
the rather technical. We will now briefly sketch some of them. 

The simplest possible justification for large margins is as follows. Since the 
training and test data are assumed to have been generated by the same underlying 
dependence, it seems reasonable to assume that most of the test patterns will lie 
close (in H) to at least one of the training patterns. For the sake of simplicity, let us 
consider the case where all test points are generated by adding bounded pattern 
noise (sometimes called input noise) to the training patterns. More precisely, given 
a training point (x, y), we will generate test points of the form (x + Ax, y), where 
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Figure 7.3 Two-dimensional toy ex- 
ample of a classification problem: Sep- 
arate ‘o’ from ‘+’ using a hyperplane. 
Suppose that we add bounded noise to 
each pattern. If the optimal margin hy- 
perplane has margin p, and the noise 
is bounded by r < p, then the hyper- 
plane will correctly separate even the 
noisy patterns. Conversely, if we ran the 
perceptron algorithm (which finds some 
separating hyperplane, but not neces- 
sarily the optimal one) on the noisy 
data, then we would recover the opti- 
mal hyperplane in the limit r > p. 


Ax € H is bounded in norm by some r > 0. Clearly, if we manage to separate the 
training set with a margin p > r, we will correctly classify all test points: Since all 
training points have a distance of at least p to the hyperplane, the test patterns will 
still be on the correct side (Figure 7.3, cf. also [152]). 

If we knew p beforehand, then this could actually be turned into an optimal 
margin classifier training algorithm, as follows. If we use an r which is slightly 
smaller than p, then even the patterns with added noise will be separable with a 
nonzero margin. In this case, the standard perceptron algorithm can be shown to 
converge.! 

Therefore, we can run the perceptron algorithm on the noisy patterns. If the al- 
gorithm finds a sufficient number of noisy versions of each pattern, with different 
perturbations Ax, then the resulting hyperplane will not intersect any of the balls 
depicted in Figure 7.3. As r approaches p, the resulting hyperplane should bet- 
ter approximate the maximum margin solution (the figure depicts the limit r = p). 
This constitutes a connection between training with pattern noise and maximizing 
the margin. The latter, in turn, can be thought of as a regularizer, comparable to 
those discussed earlier (see Chapter 4 and (2.49)). Similar connections to training 
with noise, for other types of regularizers, have been pointed out before for neural 
networks [50]. 


1. Rosenblatt’s perceptron algorithm [439] is one of the simplest conceivable iterative pro- 
cedures for computing a separating hyperplane. In its simplest form, it proceeds as fol- 
lows. We start with an arbitrary weight vector wo. At step n € N, we consider the train- 
ing example (Xn, Yn). If it is classified correctly using the current weight vector (ie., if 
sgn (Xn, Wn—1) = Yn), We set Wn := Wy-1; Otherwise, we set w, := Wr-1 + 7Yixi (here, 7 > 0 
is a learning rate). We thus loop over all patterns repeatedly, until we can complete one full 
pass through the training set without a single error. The resulting weight vector will thus 
classify all points correctly. Novikoff [386] proved that this procedure terminates, provided 
that the training set is separable with a nonzero margin. 
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Figure 7.4 Two-dimensional toy exam- 
A ple of a classification problem: Separate 
y cA ‘o’ from ‘+’ using a hyperplane passing 
; ‘ through the origin. Suppose the patterns 
f Ay ; are bounded in length (distance to the ori- 
I a gin) by R, and the classes are separated by 
S. i an optimal hyperplane (parametrized by 
ESN 5 the angle y) with margin p. In this case, 
7 we can perturb the parameter by some 
Ay with |Ay| < arcsin 4, and still correctly 

separate the data. 


A similar robustness argument can be made for the dependence of the hyper- 
plane on the parameters (w, b) (cf. [504]). If all points lie at a distance of at least 
p from the hyperplane, and the patterns are bounded in length, then small per- 
turbations to the hyperplane parameters will not change the classification of the 
training data (see Figure 7.4).* Being able to perturb the parameters of the hyper- 
plane amounts to saying that to store the hyperplane, we need fewer bits than 
we would for a hyperplane whose exact parameter settings are crucial. Interest- 
ingly, this is related to what is called the Minimum Description Length principle 
([583, 433, 485], cf. also [522, 305, 94]): The best description of the data, in terms of 
generalization error, should be the one that requires the fewest bits to store. 

We now move on to a more technical justification of large margin algorithms. 
For simplicity, we only deal with hyperplanes that have offset b = 0, leaving 
f(x) =sgn (w,x). The theorem below follows from a result in [24]. 


Theorem 7.3 (Margin Error Bound) Consider the set of decision functions f(x) = 
sgn (w,x) with ||w|| < A and ||x|| < R, for some R,A > 0. Moreover, let p > 0, and 
v denote the fraction of training examples with margin smaller than p/||w||, referred to as 
the margin error. 

For all distributions P generating the data, with probability at least 1 — 6 over the 
drawing of the m training patterns, and for any p > 0 and 6 € (0,1), the probability 
that a test pattern drawn from P will be misclassified is bounded from above, by 


c [{ R*A2 
v+ =( 7 in? m+ In(1/0)). (7.7) 


Here, c is a universal constant. 


2. Note that this would not hold true if we allowed patterns of arbitrary length — this type 
of restriction of the pattern lengths pops up in various places, such as Novikoff’s theorem 
[386], Vapnik’s VC dimension bound for margin classifiers (Theorem 5.5), and Theorem 7.3. 
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Let us try to understand this theorem. It makes a probabilistic statement about a 
probability, by giving an upper bound on the probability of test error, which itself 
only holds true with a certain probability, 1 — ô. Where do these two probabilities 
come from? The first is due to the fact that the test examples are randomly drawn 
from P; the second is due to the training examples being drawn from P. Strictly 
speaking, the bound does not refer to a single classifier that has been trained on 
some fixed data set at hand, but to an ensemble of classifiers, trained on various 
instantiations of training sets generated by the same underlying regularity P. 

It is beyond the scope of the present chapter to prove this result. The basic ingre- 
dients of bounds of this type, commonly referred to as VC bounds, are described 
in Chapter 5; for further details, see Chapter 12, and [562, 491, 504, 125]. Several 
aspects of the bound are noteworthy. The test error is bounded by a sum of the 
margin error y, and a capacity term (the \/-.. term in (7.7)), with the latter tend- 
ing to zero as the number of examples, m, tends to infinity. The capacity term can 
be kept small by keeping R and A small, and making p large. If we assume that 
R and A are fixed a priori, the main influence is p. As can be seen from (7.7), a 
large p leads to a small capacity term, but the margin error v gets larger. A small 
p, on the other hand, will usually cause fewer points to have margins smaller than 
p/|\w||, leading to a smaller margin error; but the capacity penalty will increase 
correspondingly. The overall message: Try to find a hyperplane which is aligned 
such that even for a large p, there are few margin errors. 

Maximizing p, however, is the same as minimizing the length of w. Hence we 
might just as well keep p fixed, say, equal to 1 (which is the case for canonical 
hyperplanes), and search for a hyperplane which has a small ||w|| and few points 
with a margin smaller than 1/||w||; in other words (Definition 7.2), few points such 
that y (w,x) <1. 

It should be emphasized that dropping the condition ||w|| < A would prevent 
us from stating a bound of the kind shown above. We could give an alternative 
bound, where the capacity depends on the dimensionality of the space H. The 
crucial advantage of the bound given above is that it is independent of that 
dimensionality, enabling us to work in very high dimensional spaces. This will 
become important when we make use of the kernel trick. 

It has recently been pointed out that the margin also plays a crucial role in im- 
proving asymptotic rates in nonparametric estimation [551]. This topic, however, 
is beyond the scope of the present book. 

To conclude this section, we note that large margin classifiers also have advan- 
tages of a practical nature: An algorithm that can separate a dataset with a certain 
margin will behave in a benign way when implemented in hardware. Real-world 
systems typically work only within certain accuracy bounds, and if the classifier 
is insensitive to small changes in the inputs, it will usually tolerate those inaccura- 
cies. 

We have thus accumulated a fair amount of evidence in favor of the following 
approach: Keep the margin training error small, and the margin large, in order to 
achieve high generalization ability. In other words, hyperplane decision functions 
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should be constructed such that they maximize the margin, and at the same time 
separate the training data with as few exceptions as possible. Sections 7.3 and 7.5 
respectively will deal with these two issues. 


7.3 Optimal Margin Hyperplanes 


Lagrangian 


Let us now derive the optimization problem to be solved for computing the opti- 
mal hyperplane. Suppose we are given a set of examples (x1, y1), - - <, (Xm, Ym), Xi € 
H, y; € {+1}. Here and below, the index i runs over 1,...,m™by default. We assume 
that there is at least one negative and one positive y;. We want to find a decision 
function fw (x) = sgn ((w,x) + b) satisfying 


fw pli) = yi. (7.8) 


If such a function exists (the non-separable case will be dealt with later), canoni- 
cality (7.2) implies 


yi ((xi,w) +b) > 1. (7.9) 


As an aside, note that out of the two canonical forms of the same hyperplane, (w, b) 
and (—w, —b), only one will satisfy equations (7.8) and (7.11). The existence of class 
labels thus allows to distinguish two orientations of a hyperplane. 

Following the previous section, a separating hyperplane which generalizes well 
can thus be constructed by solving the following problem: 


— bez 
=- 7-1 

minimize 7(w) = 5|lwll’, (7.10) 

subject to y;((x;,w) +b) > 1 for alli=1,...,m. (7.11) 


This is called the primal optimization problem. 

Problems like this one are the subject of optimization theory. For details on how 
to solve them, see Chapter 6; for a short intuitive explanation, cf. the remarks 
following (1.26) in the introductory chapter. We will now derive the so-called dual 
problem, which can be shown to have the same solutions as (7.10). In the present 
case, it will turn out that it is more convenient to deal with the dual. To derive it, 
we introduce the Lagrangian, 


L(w, b, a) = sll? = 5 ai (yi((xi, w} +b) — 1), (7.12) 
i=1 


with Lagrange multipliers a; > 0. Recall that as in Chapter 1, we use bold face 
Greek variables to refer to the corresponding vectors of variables, for instance, 
Q = (Q1; .. -3 Qm). 

The Lagrangian L must be maximized with respect to q;, and minimized with 


respect to w and b (see Theorem 6.26). Consequently, at this saddle point, the 
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derivatives of L with respect to the primal variables must vanish, 


Èw, b,a) =0, Lw, b,a) =0, (7.13) 
which leads to 

5 aiyi =0, (7.14) 
i=1 

and 

w= 5 QUYiXi- (7.15) 


i=l 
The solution vector thus has an expansion in terms of training examples. Note 
that although the solution w is unique (due to the strict convexity of (7.10), and 
the convexity of (7.11)), the coefficients a; need not be. 

According to the KKT theorem (Chapter 6), only the Lagrange multipliers a; 
that are non-zero at the saddle point, correspond to constraints (7.11) which are 
precisely met. Formally, for alli =1,...,m,we have 


aily((xi, w) +b) — 1] =0. (7.16) 


The patterns x; for which a; > 0 are called Support Vectors. This terminology is 
related to corresponding terms in the theory of convex sets, relevant to convex 
optimization (e.g., [334, 45]).? According to (7.16), they lie exactly on the margin.* 
All remaining examples in the training set are irrelevant: Their constraints (7.11) 
are satisfied automatically, and they do not appear in the expansion (7.15), since 
their multipliers satisfy a; =0.° 

This leads directly to an upper bound on the generalization ability of optimal 
margin hyperplanes. To this end, we consider the so-called leave-one-out method 
(for further details, see Section 12.2) to estimate the expected test error [335, 559]. 
This procedure is based on the idea that if we leave out one of the training 


3. Given any boundary point of a convex set, there always exists a hyperplane separating 
the point from the interior of the set. This is called a supporting hyperplane. 

SVs lie on the boundary of the convex hulls of the two classes, thus they possess support- 
ing hyperplanes. The SV optimal hyperplane is the hyperplane which lies in the middle of 
the two parallel supporting hyperplanes (of the two classes) with maximum distance. 

Conversely, from the optimal hyperplane, we can obtain supporting hyperplanes for all 
SVs of both classes, by shifting it by 1/||w|| in both directions. 

4. Note that this implies the solution (w, b), where b is computed using y;((w, x;) +b) = 1 for 
SVs, is in canonical form with respect to the training data. (This makes use of the reasonable 
assumption that the training set contains both positive and negative examples.) 

5. In a statistical mechanics framework, Anlauf and Biehl [12] have put forward a similar 
argument for the optimal stability perceptron, also computed using constrained optimization. 
There is a large body of work in the physics community on optimal margin classification. 
Some further references of interest are [310, 191, 192, 394, 449, 141]; other early works 
include [313]. 
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examples, and train on the remaining ones, then the probability of error on the left 
out example gives us a fair indication of the true test error. Of course, doing this for 
a single training example leads to an error of either zero or one, so it does not yet 
give an estimate of the test error. The leave-one-out method repeats this procedure 
for each individual training example in turn, and averages the resulting errors. 

Let us return to the present case. If we leave out a pattern x;., and construct 
the solution from the remaining patterns, the following outcomes are possible (cf. 
(7.11)): 


1. y» ((x, w) +b) > 1. In this case, the pattern is classified correctly and does not 
lie on the margin. These are patterns that would not have become SVs anyway. 


2. yi ((xr, w) +b) = 1. In other words, x» exactly meets the constraint (7.11). In 
this case, the solution w does not change, even though the coefficients a; would 
change: Namely, if x;. might have become a Support Vector (i.e., œp > 0) had 
it been kept in the training set. In that case, the fact that the solution is the 
same, no matter whether x; is in the training set or not, means that x; can be 
written as Èsys Giyix; with, 8; > 0. Note that condition 2 is not equivalent to saying 
that x» may be written as some linear combination of the remaining Support 
Vectors: Since the sign of the coefficients in the linear combination is determined 
by the class of the respective pattern, not any linear combination will do. Strictly 
speaking, x; must lie in the cone spanned by the y;x;, where the x; are all Support 
Vectors. For more detail, see [565] and Section 12.2. 


3. 0 < yp ((x*,w) +b) < 1. In this case, x» lies within the margin, but still on the 
correct side of the decision boundary. Thus, the solution looks different from the 
one obtained with x; in the training set (in that case, x; would satisfy (7.11) after 
training); classification is nevertheless correct. 


4. yi ((x:,w) +b) > 0. This means that x» is classified incorrectly. 


Note that cases 3 and 4 necessarily correspond to examples which would have 
become SVs if kept in the training set; case 2 potentially includes such situations. 
Only case 4, however, leads to an error in the leave-one-out procedure. Conse- 
quently, we have the following result on the generalization error of optimal mar- 
gin classifiers [570]:7 


Proposition 7.4 The expectation of the number of Support Vectors obtained during train- 
ing on a training set of size m, divided by m, is an upper bound on the expected proba- 
bility of test error of the SVM trained on training sets of size m — 1.8 


6. Possible non-uniqueness of the solution’s expansion in terms of SVs is related to zero 
Eigenvalues of (y;yj;k(x;,x;))ij, cf. Proposition 2.16. Note, however, the above caveat on the 
distinction between linear combinations, and linear combinations with coefficients of fixed 
sign. 

7 tt also holds for the generalized versions of optimal margin classifiers described in the 
following sections. 

8. Note that the leave-one-out procedure performed with m training examples thus yields 
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Figure 7.5 The optimal hyperplane 
(Figure 7.2) is the one bisecting the 
shortest connection between the con- 
vex hulls of the two classes. 


sean -a |X | <W >+b=0} 


\ 


A sharper bound can be formulated by making a further distinction in case 2, 
between SVs that must occur in the solution, and those that can be expressed in 
terms of the other SVs (see [570, 565, 268, 549] and Section 12.2). 

We now return to the optimization problem to be solved. Substituting the con- 
ditions for the extremum, (7.14) and (7.15), into the Lagrangian (7.12), we arrive at 
the dual form of the optimization problem: 


. . m 1 m 
maximize W(a@) = ` ai= 5 hy aiQjyiyj (Xi, Xj}, (7.17) 
aeR = i,j=l 
subject to aj >0, i=1,...,m, (7.18) 
m 
and J aiy; =0. (7.19) 


i=1 
On substitution of the expansion (7.15) into the decision function (7.3), we obtain 
an expression which can be evaluated in terms of dot products, taken between the 
pattern to be classified and the Support Vectors, 


f(x) = sgn (5 aiyi (x, Xi) + o) ; (7.20) 
i=l 


To conclude this section, we note that there is an alternative way to derive the 
dual optimization problem [38]. To describe it, we first form the convex hulls C4 


a bound valid for training sets of size m — 1. This difference, however, does not usually 
mislead us too much. In statistical terms, the leave-one-out error is called almost unbiased. 
Note, moreover, that the statement talks about the expected probability of test error — there 
are thus two sources of randomness. One is the expectation over different training sets of 
size m — 1, the other is the probability of test error when one of the SVMs is faced with a test 
example drawn from the underlying distribution generating the data. For a generalization, 
see Theorem 12.9. 
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and C_ of both classes of training points, 


C4 := $ cx 
y;=+1 


It can be shown that the maximum margin hyperplane as described above is the 
one bisecting the shortest line orthogonally connecting C+} and C_ (Figure 7.5). 
Formally, this can be seen by considering the optimization problem 


> aA (7.21) 
H1 


y= 


2 


minimize || X cxi- $, cxl, 
ceR” = a4 
subjectto © c;=1, X c; =1,¢; > 0, (7.22) 
yi=1 y=-1 


and using the normal vector w = Èy=1 CiXi — Èy = CiXi, scaled to satisfy the canon- 
icality condition (Definition 7.1). The threshold b is explicitly adjusted such that the 
hyperplane bisects the shortest connecting line (see also Problem 7.7). 
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Cover’s Theorem 


Thus far, we have shown why it is that a large margin hyperplane is good from a 
statistical point of view, and we have demonstrated how to compute it. Although 
these two points have worked out nicely, there is still a major drawback to the 
approach: Everything that we have done so far is linear in the data. To allow 
for much more general decision surfaces, we now use kernels to nonlinearly 
transform the input data x1,...,Xm E X into a high-dimensional feature space, 
using a map ®: x; ++ x;; we then do a linear separation there. 

To justify this procedure, Cover’s Theorem [113] is sometimes alluded to. This 
theorem characterizes the number of possible linear separations of m points in 
general position in an N-dimensional space. If m < N + 1, then all 2” separations 
are possible — the VC dimension of the function class is n + 1 (Section 5.5.6). If 
m > N +1, then Cover’s Theorem states that the number of linear separations 
equals 


2B ("77 i (7.23) 


The more we increase N, the more terms there are in the sum, and thus the larger 
is the resulting number. This theorem formalizes the intuition that the number of 
separations increases with the dimensionality. It requires, however, that the points 
are in general position — therefore, it does not strictly make a statement about 
the separability of a given dataset in a given feature space. E.g., the feature map 
might be such that all points lie on a rather restrictive lower-dimensional manifold, 
which could prevent us from finding points in general position. 

There is another way to intuitively understand why the kernel mapping in- 
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Figure 7.6 By map- 
ping the input data 
(top left) nonlin- 
early (via ®) into a 
higher-dimensional 
feature space H 
(here: H = R°), and 
constructing a sep- 
arating hyperplane 
there (bottom left), an 
fa) SVM (top right) corre- 
sponds to a nonlinear 
decision surface in 
A input space (here: 
R? R?, bottom right). We 
use X1,X2 to denote 
O the entries of the 
input vectors, and 
w1ı,W2,W3 to denote 
the entries of the 
hyperplane normal 
x vector in H. 


P:R’ R’ 
27,2 
o xi | x5 2 X |X | feature space 


w.: 
x wi 2 /W3 


fœ) 


f(x)=sgn (w X wxw 2 X,X7+b) 


creases the chances of a separation, in terms of concepts of statistical learning 
theory. Using a kernel typically amounts to using a larger function class, thus in- 
creasing the capacity of the learning machine, and rendering problems separable 
that are not linearly separable to start with. 

On the practical level, the modification necessary to perform the algorithm 
in a high-dimensional feature space are minor. In the above sections, we made 
no assumptions on the dimensionality of H, the space in which we assumed 
our patterns belong. We only required H to be equipped with a dot product. 
The patterns x; that we talked about previously thus need not coincide with 
the input patterns. They can equally well be the results of mapping the original 
input patterns x; into a high-dimensional feature space. Consequently, we take 
the stance that wherever we wrote x, we actually meant ®(x). Maximizing the 
target function (7.17), and evaluating the decision function (7.20), then requires 
the computation of dot products (®(x), ®(x;)) in a high-dimensional space. These 
expensive calculations are reduced significantly by using a positive definite kernel 
k (see Chapter 2), such that 


(®(x), (x;)) = k(x, xi); (7.24) 

leading to decision functions of the form (cf. (7.20)) 

f(x) = sgn È yiaik(x, xi) + e) f (7.25) 
i=1 
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Figure 7.7 Architecture of SVMs. The kernel function k is chosen a priori; it determines the 
type of classifier (for instance, polynomial classifier, radial basis function classifier, or neural 
network). All other parameters (number of hidden units, weights, threshold b) are found 
during training, by solving a quadratic programming problem. The first layer weights x; 
are a subset of the training set (the Support Vectors); the second layer weights A; = yi&i are 
computed from the Lagrange multipliers (cf. (7.25)). 


At this point, a small aside regarding terminology is in order. As explained in 
Chapter 2, the input domain X need not be a vector space. Therefore, the Support 
Vectors in (7.25) (i.e., those x; with a; > 0) are not necessarily vectors. One could 
choose to be on the safe side, and only refer to the corresponding ®(x;) as SVs. 
Common usage employs the term in a somewhat loose sense for both, however. 

Consequently, everything that has been said about the linear case also applies 
to nonlinear cases, obtained using a suitable kernel k, instead of the Euclidean dot 
product (Figure 7.6). By using some of the kernel functions described in Chapter 2, 
the SV algorithm can construct a variety of learning machines (Figure 7.7), some 
of which coincide with classical architectures: polynomial classifiers of degree d, 


k(x, xi) = (x, x;)", (7.26) 
radial basis function classifiers with Gaussian kernel of width c > 0, 

k(x, xi) = exp (—||x — xi||*/c) , (7.27) 
and neural networks (e.g., [49, 235]) with tanh activation function, 

k(x, xj) = tanh(« (x, x;) + ©). (7.28) 


The parameters & > 0 and © € R are the gain and horizontal shift. As we shall 
see later, the tanh kernel can lead to very good results. Nevertheless, we should 
mention at this point that from a mathematical point of view, it has certain short- 
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Program 


Threshold 


Comparison to 
RBF Network 


comings, cf. the discussion following (2.69). 
To find the decision function (7.25), we solve the following problem (cf. (7.17)): 


m m 


NE 1 
maximize W(a) = 2o E BS aiajyiy jk(xi, xj), (7.29) 


i,j=l1 
subject to the constraints (7.18) and (7.19). 

If k is positive definite, Qj; := (yiyjk(x;, x;))ij is a positive definite matrix (Prob- 
lem 7.6), which provides us with a convex problem that can be solved efficiently 
(cf. Chapter 6). To see this, note that (cf. Proposition 2.16) 


DY aiajyiyjk(xi,x)) = (š aiy:®(x;), X, amen) >0, (7.30) 
i j=1 i=l j=l 
for alla € R”. 


As described in Chapter 2, we can actually use a larger class of kernels without 
destroying the convexity of the quadratic program. This is due to the fact that 
the constraint (7.19) excludes certain parts of the space of multipliers a;. As a 
result, we only need the kernel to be positive definite on the remaining points. 
This is precisely guaranteed if we require k to be conditionally positive definite 
(see Definition 2.21). In this case, we have a! Qa > 0 for all coefficient vectors a 
satisfying (7.19). 

To compute the threshold b, we take into account that due to the KKT conditions 
(7.16), aj > 0 implies (using (7.24)) 

m 

> Yiaik(xj, Xi) +b= Yj. (7.31) 
i=1 

Thus, the threshold can for instance be obtained by averaging 


m 


b = yj — > yiaik(x;, xi), (7.32) 
i=1 


over all points with a; > 0; in other words, all SVs. Alternatively, one can compute 
b from the value of the corresponding double dual variable; see Section 10.3 for 
details. Sometimes it is also useful not to use the “optimal” b, but to change it in 
order to adjust the number of false positives and false negatives. 

Figure 1.7 shows how a simple binary toy problem is solved, using a Support 
Vector Machine with a radial basis function kernel (7.27). Note that the SVs are the 
patterns closest to the decision boundary — not only in the feature space, where 
by construction, the SVs are the patterns closest to the separating hyperplane, but 
also in the input space depicted in the figure. This feature differentiates SVMs 
from other types of classifiers. Figure 7.8 shows both the SVs and the centers ex- 
tracted by k-means, which are the expansion patterns that a classical RBF network 
approach would employ. 

Ina study comparing the two approaches on the USPS problem of handwritten 
character recognition, a SVM with a Gaussian kernel outperformed the classical 
RBF network using Gaussian kernels [482]. A hybrid approach, where the SVM 
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Figure 7.8 RBF centers automat- 
ically computed by the Support 
Vector algorithm (indicated by ex- 
tra circles), using a Gaussian ker- 
nel. The number of SV centers ac- 
cidentally coincides with the num- 
ber of identifiable clusters (indi- 
cated by crosses found by k-means 
clustering, with k = 2 and k = 3 for 
balls and circles, respectively), but 
the naive correspondence between 
clusters and centers is lost; indeed, 
3 of the SV centers are circles, and 
only 2 of them are balls. Note that 
the SV centers are chosen with re- 
spect to the classification task to be 
solved (from [482]). 


algorithm was used to identify the centers (or hidden units) for the RBF network 
(that is, as a replacement for k-means), exhibited a performance which was in 
between the previous two. The study concluded that the SVM algorithm yielded 
two advantages. First, it better identified good expansion patterns, and second, its 
large margin regularizer led to second-layer weights that generalized better. We 
should add, however, that using clever engineering, the classical RBF algorithm 
can be improved to achieve a performance close to the one of SVMs [427]. 


7.5 Soft Margin Hyperplanes 


So far, we have not said much about when the above will actually work. In 
practice, a separating hyperplane need not exist; and even if it does, it is not 
always the best solution to the classification problem. After all, an individual 
outlier in a data set, for instance a pattern which is mislabelled, can crucially affect 
the hyperplane. We would rather have an algorithm which can tolerate a certain 
fraction of outliers. 

A natural idea might be to ask for the algorithm to return the hyperplane 
that leads to the minimal number of training errors. Unfortunately, it turns out 
that this is a combinatorial problem. Worse still, the problem is even hard to 
approximate: Ben-David and Simon [34] have recently shown that it is NP-hard to 
find a hyperplane whose training error is worse by some constant factor than the 
optimal one. Interestingly, they also show that this can be alleviated by taking 
into account the concept of the margin. By disregarding points that are within 
some fixed positive margin of the hyperplane, then the problem has polynomial 
complexity. 

Cortes and Vapnik [111] chose a different approach for the SVM, following [40]. 
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To allow for the possibility of examples violating (7.11), they introduced so-called 
slack variables, 


& >0, wherei=1,...,m, (7.33) 
and use relaxed separation constraints (cf. (7.11)), 
yi((Xi,w) +b) >1-&, i=1,...,m. (7.34) 


Clearly, by making €; large enough, the constraint on (x;, y;) can always be met. In 
order not to obtain the trivial solution where all take on large values, we thus 
need to penalize them in the objective function. To this end, a term ¥; & is included 
in (7.10). 

In the simplest case, referred to as the C-SV classifier, this is done by solving, for 
some C > 0, 
minimize r(w, £) = Tiwi + 2 S & (7.35) 

wEF,€ER” i 2 me i ` 
subject to the constraints (7.33) and (7.34). It is instructive to compare this to 
Theorem 7.3, considering the case p = 1. Whenever the constraint (7.34) is met 
with £; = 0, the corresponding point will not be a margin error. All non-zero slacks 
€ correspond to margin errors; hence, roughly speaking, the fraction of margin 
errors in Theorem 7.3 increases with the second term in (7.35). The capacity term, 
on the other hand, increases with ||w||. Hence, for a suitable positive constant C, 
this approach approximately minimizes the right hand side of the bound. 

Note, however, that if many of the €; attain large values (in other words, if the 
classes to be separated strongly overlap, for instance due to noise), then Y7, € can 
be significantly larger than the fraction of margin errors. In that case, there is no 
guarantee that the hyperplane will generalize well. 

As in the separable case (7.15), the solution can be shown to have an expansion 

m 
w= QiYiXi, (7.36) 
i=1 
where non-zero coefficients a; can only occur if the corresponding example (x;, y;) 
precisely meets the constraint (7.34). Again, the problem only depends on dot 
products in H, which can be computed by means of the kernel. 

The coefficients a; are found by solving the following quadratic programming 

problem: 


m 1 m 
maximize W(a) = > (Oo! by aia yiy jk(xi, xj), (7.37) 
acR” = 2 ifn 
, C j 
subject to 0 < a; < a for alli=1,...,m, (7.38) 
and Sy QiYi = 0. (7.39) 


i=1 


To compute the threshold b, we take into account that due to (7.34), for Support 
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Vectors x; for which €; = 0, we have (7.31). Thus, the threshold can be obtained by 
averaging (7.32) over all Support Vectors x; (recall that they satisfy a; > 0) with 
Qj < C. 

In the above formulation, C is a constant determining the trade-off between 
two conflicting goals: minimizing the training error, and maximizing the margin. 
Unfortunately, C is a rather unintuitive parameter, and we have no a priori way 
to select it.? Therefore, a modification was proposed in [481], which replaces C by 
a parameter v; the latter will turn out to control the number of margin errors and 
Support Vectors. 

As a primal problem for this approach, termed the v-SV classifier, we consider 


ao, 1 3 1 
Po TONE = a vets dé (7.40) 
subject to yi((xi,w) +b) > p — & (7.41) 
and &;>0, p>0. (7.42) 


Note that no constant C appears in this formulation; instead, there is a parameter 
v, and also an additional variable p to be optimized. To understand the role of 
p, note that for € = 0, the constraint (7.41) simply states that the two classes are 
separated by the margin 2p/||w|| (cf. Problem 7.4). 

To explain the significance of v, let us first recall the term margin error: by this, 
we denote points with é; > 0. These are points which are either errors, or lie within 
the margin. Formally, the fraction of margin errors is 
Renplgl = + [filis < p}: 7.43) 
Here, g is used to denote the argument of the sgn in the decision function (7.25): 
f =sgn og. We are now ina position to state a result that explains the significance 
of v. 


Proposition 7.5 ([481]) Suppose we run v-SVC with k on some data with the result that 
p > 0. Then 


(i) v is an upper bound on the fraction of margin errors. 

(ii) v is a lower bound on the fraction of SVs. 

(iti) Suppose the data (x1,¥1),.-.,(Xm, Ym) were generated tid from a distribution 
P(x, y) = P(x)P(y|x), such that neither P(x, y = 1) nor P(x, y = —1) contains any dis- 
crete component. Suppose, moreover, that the kernel used is analytic and non-constant. 
With probability 1, asymptotically, v equals both the fraction of SVs and the fraction of 
errors. 


The proof can be found in Section A.2. 
Before we get into the technical details of the dual derivation, let us take a look 


9. Asa default value, we use C/m = 10 unless stated otherwise. 
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Figure 7.9 Toy problem (task: separate circles from disks) solved using v-SV classification, 
with parameter values ranging from v = 0.1 (top left) to v = 0.8 (bottom right). The larger 
we make v, the more points are allowed to lie inside the margin (depicted by dotted lines). 
Results are shown for a Gaussian kernel, k(x, x’) = exp(—||x — x’||?). 


Table 7.1 Fractions of errors and SVs, along with the margins of class separation, for the 
toy example in Figure 7.9. 

Note that v upper bounds the fraction of errors and lower bounds the fraction of SVs, and 
that increasing vy, i.e., allowing more errors, increases the margin. 


[Tor po pe poa poe Tos po e 
[kacionofenors [000 | 007 [025 |032 |039 [oso [oa [on | 


margin p/w 


at a toy example illustrating the influence of v (Figure 7.9). The corresponding 
fractions of SVs and margin errors are listed in table 7.1. 

The derivation of the v-SVC dual is similar to the above SVC formulations, only 
slightly more complicated. We consider the Lagrangian 


m 


L(w, &,b, p, a , 2,6) = Ilw — vp +— m i 


= Y (ailyi((xi,w) +b) — p+ &i) + Pi&i) — dp, (7.44) 
i=1 


using multipliers aj, ĝ;,ô > 0. This function has to be minimized with respect to 
the primal variables w, €,), p, and maximized with respect to the dual variables 
a, B, ô. To eliminate the former, we compute the corresponding partial derivatives 
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and set them to 0, obtaining the following conditions: 


w= 5 QiYiXi, (7.45) 
i=1 

ai + 8; =1/m, (7.46) 

> aiyi=0, (7.47) 

i=1 

Yaj-d=v. (7.48) 

i=1 


Again, in the SV expansion (7.45), the a; that are non-zero correspond to a con- 
straint (7.41) which is precisely met. 

Substituting (7.45) and (7.46) into L, using aj, 3;,6 > 0, and incorporating ker- 
nels for dot products, leaves us with the following quadratic optimization problem 
for v-SV classification: 


. ; 1 m 
ee W(a) = -3 x aiaryyiy jk(x;, xj), (7.49) 
1 

subject to 0 < a; < ae (7.50) 
Y aiyi =0, (7.51) 
i=1 
Ya; > v. (7.52) 
i=1 


As above, the resulting decision function can be shown to take the form 


f(x) = sgn (š aiyik(x, xi) + ) (7.53) 
i=1 

Compared with the C-SVC dual (7.37), there are two differences. First, there is an 

additional constraint (7.52).!9 Second, the linear term ©", a; no longer appears in 

the objective function (7.49). This has an interesting consequence: (7.49) is now 

quadratically homogeneous in a. It is straightforward to verify that the same 


decision function is obtained if we start with the primal function 


T(w,£, p) = sll? +C (-vo+ = 36 (7.54) 
i=1 


10. The additional constraint makes it more challenging to come up with efficient training 
algorithms for large datasets. So far, two approaches have been proposed which work well. 
One of them slightly modifies the primal problem in order to avoid the other equality con- 
straint (related to the offset b) [98]. The other one is a direct generalization of a correspond- 
ing algorithm for C-SVC, which reduces the problem for each chunk to a linear system, and 
which does not suffer any disadvantages from the additional constraint [407, 408]. See also 
Sections 10.3.2, 10.4.3, and 10.6.3 for further details. 
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i.e., if one does use C, cf. Problem 7.16. 

To compute the threshold b and the margin parameter p, we consider two 
sets S4, of identical size s > 0, containing SVs x; with 0 < a; < 1 and y; = +1, 
respectively. Then, due to the KKT conditions, (7.41) becomes an equality with 
é; = 0. Hence, in terms of kernels, 


1 m 
b=-7 2 2 ayk x)), (7.55) 
xES4US- j=1 
1 m m 
p= =( ¥ SY ajyjk(x, xj) — > È ayyik(x,x)). (7.56) 
S \x€S,, j=I x€S_ j=l 


Note that for the decision function, only b is actually required. 
A connection to standard SV classification, and a somewhat surprising interpre- 
tation of the regularization parameter C, is described by the following result: 


Proposition 7.6 (Connection v-S VC — C-SVC [481]) If v-SV classification leads to 
p > 0, then C-SV classification, with C set a priori to 1/p, leads to the same decision 
function. 


Proof If we minimize (7.40), and then fix p to minimize only over the remaining 
variables, nothing will change. Hence the solution wọ, bo, ọ minimizes (7.35), for 
C = 1, subject to (7.41). To recover the constraint (7.34), we rescale to the set of 
variables w’ = w/p,b! = b/p, € = £/ p. This leaves us with the objective function 
(7.35), up to a constant scaling factor p°, using C = 1/p. a 


For further details on the connection between v-SVMs and C-SVMs, see [122, 38]. 
A complete account has been given by Chang and Lin [98], who show that for a 
given problem and kernel, there is an interval [Vmin, Vmax] of admissible values 
for v, with 0 < Vmin < Vmax < 1. The boundaries of the interval are computed 
by considering ¥; a; as returned by the C-SVM in the limits C + oo and C > 0, 
respectively. 

It has been noted that v-SVMs have an interesting interpretation in terms of 
reduced convex hulls [122, 38] (cf. (7.21)). If a problem is non-separable, the convex 
hulls will no longer be disjoint. Therefore, it no longer makes sense to search for the 
shortest line connecting them, and the approach of (7.22) will fail. In this situation, 
it seems natural to reduce the convex hulls in size, by limiting the size of the 
coefficients c; in (7.21) to some value v € (0, 1). Intuitively, this amounts to limiting 
the influence of individual points — note that in the original problem (7.22), two 
single points can already determine the solution. It is possible to show that the v- 
SVM formulation solves the problem of finding the hyperplane orthogonal to the 
closest line connecting the reduced convex hulls [122]. 

We now move on to another aspect of soft margin classification. When we 
introduced the slack variables, we did not attempt to justify the fact that in the 
objective function, we used a penalizer >", é. Why not use another penalizer, 


such as )"", €?, for some p > 0 [111]? For instance, p = 0 would yield a penalizer 
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that exactly counts the number of margin errors. Unfortunately, however, it is also a 
penalizer that leads to a combinatorial optimization problem. Penalizers yielding 
optimization problems that are particularly convenient, on the other hand, are 
obtained for p = 1 and p = 2. By default, we use the former, as it possesses an 
additional property which is statistically attractive. As the following proposition 
shows, linearity of the target function in the slack variables €; leads to a certain 
“outlier” resistance of the estimator. As above, we use the shorthand x; for ®(x;). 


Proposition 7.7 (Resistance of SV classification [481]) Suppose w can be expressed 
in terms of the SVs which are not at bound, 


m 


w= 5 ViXi (7.57) 
i=1 


with +; # 0 only if a; € (0,1/m) (where the a; are the coefficients of the dual solution). 
Then local movements of any margin error X parallel to w do not change the hyperplane." 


The proof can be found in Section A.2. For further results in support of the p = 1 
case, see [527]. 

Note that the assumption (7.57) is not as restrictive as it may seem. Even though 
the SV expansion of the solution, w = D7, ajy;x;, often contains many multipliers 
a; which are at bound, it is nevertheless quite conceivable, especially when dis- 
carding the requirement that the coefficients be bounded, that we can obtain an 
expansion (7.57) in terms of a subset of the original vectors. 

For instance, if we have a 2-D problem that we solve directly in input space, i.e., 
with k(x, x’) = (x, x’), then it suffices to have two linearly independent SVs which 
are not at bound, in order to express w. This holds true regardless of whether or 
not the two classes overlap, even if there are many SVs which are at the upper 
bound. Further information on resistance and robustness of SVMs can be found in 
Sections 3.4 and 9.3. 

We have introduced SVs as those training examples x; for which a; > 0. In 
some cases, it is useful to further distinguish different types of SVs. For reference 
purposes, we give a list of different types of SVs (Table 7.2). 

In Section 7.3, we used the KKT conditions to argue that in the hard margin case, 
the SVs lie exactly on the margin. Using an identical argument for the soft margin 
case, we see that in this instance, in-bound SVs lie on the margin (Problem 7.9). 

Note that in the hard margin case, where Qmax = 00, every SV is an in-bound 
SV. Note, moreover, that for kernels that produce full-rank Gram matrices, such as 
the Gaussian (Theorem 2.18), in theory every SV is essential (provided there are 
no duplicate patterns in the training set). 1? 


11. Note that the perturbation of the point is carried out in feature space. What it precisely 
corresponds to in input space therefore depends on the specific kernel chosen. 
12. In practice, Gaussian Gram matrices usually have some eigenvalues that are close to 0. 
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Table 7.2 Overview of different types of SVs. In each case, the condition on the Lagrange 
multipliers a; (corresponding to an SV x;) is given. In the table, Qmax stands for the upper 
bound in the optimization problem; for instance, max = in (7.38) and Omax = in (7.50). 


Type ots 
Gandara SV 


| in-bound SV_| 

bound SV Qi = Omax usually lies in margin 
a i ee 
essential SV appears in all possible | becomes margin error 
a | wheter 73) 


7.6 Multi-Class Classification 


Reject Decisions 


So far, we have talked about binary classification, where the class labels can 
only take two values: +1. Many real-world problems, however, have more than 
two classes — an example being the widely studied optical character recognition 
(OCR) problem. We will now review some methods for dealing with this issue. 


7.6.1 One Versus the Rest 


To get M-class classifiers, it is common to construct a set of binary classifiers 
f',...,f™, each trained to separate one class from the rest, and combine them 
by doing the multi-class classification according to the maximal output before ap- 
plying the sgn function; that is, by taking 


m 
argmax g/(x), where g/(x) = >` yialk(x, x) +b (7.58) 
j=1,..,M i=1 
(note that f/(x) = sgn (gi (x)), cf. (7.25)). 

The values g/(x) can also be used for reject decisions. To see this, we consider 
the difference between the two largest g/(x) as a measure of confidence in the 
classification of x. If that measure falls short of a threshold 6, the classifier rejects 
the pattern and does not assign it to a class (it might instead be passed on to 
a human expert). This has the consequence that on the remaining patterns, a 
lower error rate can be achieved. Some benchmark comparisons report a quantity 
referred to as the punt error, which denotes the fraction of test patterns that must 
be rejected in order to achieve a certain accuracy (say 1% error) on the remaining 
test samples. To compute it, the value of 6 is adjusted on the test set [64]. 

The main shortcoming of (7.58), sometimes called the winner-takes-all approach, 
is that it is somewhat heuristic. The binary classifiers used are obtained by training 
on different binary classification problems, and thus it is unclear whether their 
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real-valued outputs (before thresholding) are on comparable scales. 1° This can be 
a problem, since situations often arise where several binary classifiers assign the 
pattern to their respective class (or where none does); in this case, one class must 
be chosen by comparing the real-valued outputs. 

In addition, binary one-versus-the-rest classifiers have been criticized for deal- 
ing with rather asymmetric problems. For instance, in digit recognition, the clas- 
sifier trained to recognize class ‘7’ is usually trained on many more negative than 
positive examples. We can deal with these asymmetries by using values of the reg- 
ularization constant C which differ for the respective classes (see Problem 7.10). It 
has nonetheless been argued that the following approach, which is more symmet- 
ric from the outset, can be advantageous. 


7.6.2 Pairwise Classification 


In pairwise classification, we train a classifier for each possible pair of classes 
[178, 463, 233, 311]. For M classes, this results in (M — 1)M/2 binary classifiers. 
This number is usually larger than the number of one-versus-the-rest classifiers; 
for instance, if M = 10, we need to train 45 binary classifiers rather than 10 as 
in the method above. Although this suggests large training times, the individual 
problems that we need to train on are significantly smaller, and if the training 
algorithm scales superlinearly with the training set size, it is actually possible to 
save time. 

Similar considerations apply to the runtime execution speed. When we try to 
classify a test pattern, we evaluate all 45 binary classifiers, and classify according 
to which of the classes gets the highest number of votes. A vote for a given 
class is defined as a classifier putting the pattern into that class.!4 The individual 
classifiers, however, are usually smaller in size (they have fewer SVs) than they 
would be in the one-versus-the-rest approach. This is for two reasons: First, the 
training sets are smaller, and second, the problems to be learned are usually easier, 
since the classes have less overlap. 

Nevertheless, if M is large, and we evaluate the (M — 1)M/2 classifiers, then the 
resulting system may be slower than the corresponding one-versus-the-rest SVM. 
To illustrate this weakness, consider the following hypothetical situation: Suppose, 
in a digit recognition task, that after evaluating the first few binary classifiers, 
both digit 7 and digit 8 seem extremely unlikely (they already “lost” on several 
classifiers). In that case, it would seem pointless to evaluate the 7-vs-8 classifier. 
This idea can be cast into a precise framework by embedding the binary classifiers 
into a directed acyclic graph. Each classification run then corresponds to a directed 
traversal of that graph, and classification can be much faster [411]. 


13. Note, however, that some effort has gone into developing methods for transforming the 
real-valued outputs into class probabilities [521, 486, 410]. 
14. Some care has to be exercised in tie-breaking. For further detail, see [311]. 
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7.6.3 Error-Correcting Output Coding 


The method of error-correcting output codes was developed in [142], and later 
adapted to the case of SVMs [5]. In a nutshell, the idea is as follows. Just as we 
can generate a binary problem from a multiclass problem by separating one class 
from the rest — digit 0 from digits 1 through 9, say — we can generate a large 
number of further binary problems by splitting the original set of classes into two 
subsets. For instance, we could separate the even digits from the odd ones, or we 
could separate digits 0 through 4 from 5 through 9. It is clear that if we design 
a set of binary classifiers f',...,f' in the right way, then the binary responses 
will completely determine the class of a test patterns. Each class corresponds to a 
unique vector in {+1}4; for M classes, we thus get a so-called decoding matrix M € 
{+1}“~". What happens if the binary responses are inconsistent with each other; 
if, for instance, the problem is noisy, or the training sample is too small to estimate 
the binary classifiers reliably? Formally, this means that we will obtain a vector 
of responses f!(x),..., f (x) which does not occur in the matrix M. To deal with 
these cases, [142] proposed designing a clever set of binary problems, which yields 
robustness against some errors. Here, the closest match between the vector of 
responses and the rows of the matrix is determined using the Hamming distance 
(the number of entries where the two vectors differ; essentially, the Loo distance). 
Now imagine a situation where the code is such that the minimal Hamming 
distance is three. In this case, we can guarantee that we will correctly classify all 
test examples which lead to at most one error amongst the binary classifiers. 

This method produces very good results in multi-class tasks; nevertheless, it 
has been pointed out that it does not make use of a crucial quantity in classifiers: 
the margin. Recently [5], a version was developed that replaces the Hamming- 
based decoding with a more sophisticated scheme that takes margins into account. 
Recommendations are also made regarding how to design good codes for margin 
classifiers, such as SVMs. 


7.6.4 Multi-Class Objective Functions 


Arguably the most elegant multi-class algorithm, and certainly the method most 
closely aligned with Vapnik’s principle of always trying to solve problems directly, 
entails modifying the SVM objective function in such a way that it simultaneously 
allows the computation of a multi-class classifier. For instance [593, 58], we can 
modify (7.35) and use the following quadratic program: 
winnie S jat S Se (7.59) 
wEH,€ ER”, b,ER 2 r=1 i m i=1 rÆyi 7 
subject to (wy,,x;) + by, > (wr, xi) +b, +2 — &, 
& >0, (7.60) 
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where m € {1,...,M}\ yi, and y; € {1,...,M} is the multi-class label of the 
pattern x; (cf. Problem 7.17). 

In terms of accuracy, the results obtained with this approach are comparable to 
those obtained with the widely used one-versus-the-rest approach. Unfortunately, 
the optimization problem is such that it has to deal with all SVs at the same 
time. In the other approaches, the individual binary classifiers usually have much 
smaller SV sets, with beneficial effects on the training time. For further multiclass 
approaches, see [160, 323]. Generalizations to multi-label problems, where patterns 
are allowed to belong to several classes at the same time, are discussed in [162]. 

Overall, it is fair to say that there is probably no multi-class approach that gen- 
erally outperforms the others. For practical problems, the choice of approach will 
depend on constraints at hand. Relevant factors include the required accuracy, the 
time available for development and training, and the nature of the classification 
problem (e.g., for a problem with very many classes, it would not be wise to use 
(7.59)). That said, a simple one-against-the-rest approach often produces accept- 
able results. 


7.7 Variations on a Theme 


Linear 
Programming 
Machines 


lı Regularizer 


There are a number of variations of the standard SV classification algorithm, such 
as the elegant leave-one-out machine [589, 592] (see also Section 12.2.2 below), the 
idea of Bayes point machines [451, 239, 453, 545, 392], and extensions to feature 
selection [70, 224, 590]. Due to lack of space, we only describe one of the variations; 
namely, linear programming machines. 

As we have seen above, the SVM approach automatically leads to a decision 
function of the form (7.25). Let us rewrite it as f(x) = sgn(g(x)), with 


g(x) = y vik(x, xi) +b. (7.61) 
i=1 


In Chapter 4, we showed that this form of the solution is essentially a consequence 
of the form of the regularizer ||w||? (Theorem 4.2). The idea of linear programming 
(LP) machines is to use the kernel expansion as an ansatz for the solution, but to 
use a different regularizer, namely the ¢; norm of the coefficient vector [343, 344, 
74, 184, 352, 37, 591, 593, 39]. The main motivation for this is that this regularizer 
is known to induce sparse expansions (see Chapter 4). 

This amounts to the objective function 


1 
Rreglg] := —||v||1 + C Remplg], (7.62) 
m 


where ||v||1 = D7, |vi| denotes the 41 norm in coefficient space, using the soft 
margin empirical risk, 


Remplg] = 15 Ei (7.63) 
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with slack terms 
éi = max{1 — yig(xi), 0}. (7.64) 


We thus obtain a linear programming problem; 
a r 1 m g m 
minimize a Beare aia 
subject to y;g(x;) > 1- &, (7.65) 
Qi, ar, &j = 0. 


Here, we have dealt with the ¢;-norm by splitting each component w; into its 
positive and negative part: vj = a; — a; in (7.61). The solution differs from (7.25) 
in that it is no longer necessarily the case that each expansion pattern has a weight 
aiyi, whose sign equals its class label. This property would have to be enforced 
separately (Problem 7.19). Moreover, it is also no longer the case that the expansion 
patterns lie on or beyond the margin — in LP machines, they can basically be 
anywhere. 

v-LPMs LP machines can also benefit from the v-trick. In this case, the programming 
problem can be shown to take the following form [212]: 


minimize + > &-— vp, 


a, EER”,b,pER 
m 
subjectto +; 2 (ai + a7) = 1, (7.66) 
igi) > p— £i, 
aij, oF, Ei P = 0. 


We will not go into further detail at this point. Additional information on 
linear programming machines from a regularization point of view is given in 
Section 4.9.2. 


7.8 Experiments 
7.8.1 Digit Recognition Using Different Kernels 


Handwritten digit recognition has long served as a test bed for evaluating and 
benchmarking classifiers [318, 64, 319]. Thus, it was imperative in the early days of 
SVM research to evaluate the SV method on widely used digit recognition tasks. In 
this section we report results on the US Postal Service (USPS) database (described 
in Section A.1). We shall return to the character recognition problem in Chapter 11, 
where we consider the larger MNIST database. 

As described above, the difference between C-SVC and v-SVC lies only in the 
fact that we have to select a different parameter a priori. If we are able to do this 
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Table 7.3 Performance on the USPS set, for three different types of classifier, constructed 
with the Support Vector algorithm by choosing different functions k in (7.25) and (7.29). 
Error rates on the test set are given; and for each of the ten-class-classifiers, we also show 
the average number of Support Vectors of the ten two-class-classifiers. The normalization 
factor of 256 is tailored to the dimensionality of the data, which is 16 x 16. 


polynomial: k(x, x’) = ((x, x’) /256)" 


raven [as par ao] az [as [as 


RBF: k(x, x') = exp (—||x — x'||?/(256 oO) 
ee PEE 
Eee 
Cav. ofsvs_| 266 | 240 | 233 | 235 | 25 


sigmoid: k(x, x’) = tanh(2 (x, x’) /256 + ©) 


-2 | 08 | 09 | 10] 11 | 12] 13 | 14 | 


well, we obtain identical performance. The experiments reported were carried out 
before the development of v-SVC, and thus all use C-SVC code. 

In the present study, we put particular emphasis on comparing different types 
of SV classifiers obtained by choosing different kernels. We report results for poly- 
nomial kernels (7.26), Gaussian radial basis function kernels (7.27), and sigmoid 
kernels (7.28), summarized in Table 7.3. In all three cases, error rates around 4% 
can be achieved. 

Note that in practical applications, it is usually helpful to scale the argument 
of the kernel, such that the numerical values do not get extremely small or large 
as the dimension of the data increases. This helps avoid large roundoff errors, 
and prevents over- and underflow. In the present case, the scaling was done by 
including the factor 256 in Table 7.3. 

The results show that the Support Vector algorithm allows the construction of 
a range of learning machines, all of which perform well. The similar performance 
for the three different functions k suggests that among these cases, the choice of 
the set of decision functions is less important than capacity control in the chosen 
type of structure. This phenomenon is well-known for the Parzen window density 
estimator in RY (e.g., [226]) 


p(x) = 1 ` 2i E = =) ' (7.67) 


N 
j= W W 


It is of great importance in this case to choose an appropriate value of the band- 
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Figure 7.10 2D toy example of a binary classification problem solved using a soft margin 
SVC. In all cases, a Gaussian kernel (7.27) is used. From left to right, we decrease the kernel 
width. Note that for a large width, the decision boundary is almost linear, and the data 
set cannot be separated without error (see text). Solid lines represent decision boundaries; 
dotted lines depict the edge of the margin (where (7.34) becomes an equality with €; = 0). 


width parameter w for a given amount of data. Similar parallels can be drawn to 
the solution of ill-posed problems; for a discussion, see [561]. 

Figure 7.10 shows a toy example using a Gaussian kernel (7.27), illustrating that 
it is crucial to pick the right kernel parameter. In all cases, the same value of C 
was used, but the kernel width c was varied. For large values of c, the classifier is 
almost linear, and it cannot separate the data set without errors. For a small width 
(right), the data set is practically memorized. For an intermediate width (middle), 
a trade-off is made between allowing some training errors and using a “simple” 
decision boundary. 

In practice, both the kernel parameters and the value of C (or v) are often chosen 
using cross validation. To this end, we first split the data set into p parts of equal 
size, say, p = 10. We then perform ten training runs. Each time, we leave out one 
of the ten parts and use it as an independent validation set for optimizing the 
parameters. In the simplest case, we choose the parameters which work best, on 
average over the ten runs. It is common practice, however, to then train on the full 
training set, using these average parameters. There are some problems with this. 
First, it amounts to optimizing the parameters on the same set as the one used for 
training, which can lead to overfitting. Second, the optimal parameter settings for 
data sets of size m and fm, respectively, do not usually coincide. Typically, the 
smaller set will require a slightly stronger regularization. This could mean a wider 
Gaussian kernel, a smaller polynomial degree, a smaller C, or a larger v. Even 
worse, it is theoretically possible that there is a so-called phase transition (e.g., 
[393]) in the learning curve between the two sample sizes. This means that the 
generalization error as a function of the sample size could change dramatically 
between 4m and m. Having said all this, practitioners often do not care about 
these theoretical precautions, and use the unchanged parameters with excellent 
results. For further detail, see Section 12.2. 

In some cases, one can try to avoid the whole procedure by using an educated 
guess. Below, we list several methods. 
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= Use parameter settings that have worked well for similar problems. Here, some 
care has to be exercised in the scaling of kernel parameters. For instance, when 
using an RBF kernel, c must be rescaled to ensure that ||x; — x;||?/c roughly lies in 
the same range, even if the scaling and dimension of the data are different. 


= For many problems, there is some prior expectation regarding the typical error 
rate. Let us assume we are looking at an image classification task, and we have 
already tried three other approaches, all of which yielded around 5% test error. 
Using v—SV classifiers, we can incorporate this knowledge by choosing a value 
for v which is in that range, say v = 5%. The reason for this guess is that we know 
(Proposition 7.5) that the margin error is then below 5%, which in turn implies that 
the training error is below 5%. The training error will typically be smaller than the 
test error, thus it is consistent that it should be upper bounded by the 5% test error. 


= In a slightly less elegant way, one can try to mimic this procedure for C-SV 
classifiers. To this end, we start off with a large value of C, and reduce it until 
the number of Lagrange multipliers that are at the upper bound (in other words, 
the number of margin errors) is in a suitable range (say, somewhat below 5%). 
Compared to the above procedure for choosing v, the disadvantage is that this 
entails a number of training runs. We can also monitor the number of actual 
training errors during the training runs, but since not every margin error is a 
training error, this is often less sensitive. Indeed, the difference between training 
error and test error can often be quite substantial. For instance, on the USPS set, 
most of the results reported here were obtained with systems that had essentially 
zero training error. 


= One can put forward scaling arguments which indicate that C œ 1/R?, where R 
is a measure for the range of the data in feature space that scales like the length 
of the points in H. Examples thereof are the standard deviation of the distance of 
the points to their mean, the radius of the smallest sphere containing the data (cf. 
(5.61) and (8.17)), or, in some cases, the maximum (or mean) length k(x;,x;) over 
all data points (see Problem 7.25). 


= Finally, we can use theoretical tools such as VC bounds (see, for instance, Fig- 
ure 5.5) or leave-one-out bounds (Section 12.2). 


Having seen that different types of SVCs lead to similar performance, the ques- 
tion arises as to how these performances compare with other approaches. Table 7.4 
gives a summary of a number of results on the USPS set. Note that the best SVM 
result is 3.0%; it uses additional techniques that we shall explain in chapters 11 
and 13. It is known that the USPS test set is rather difficult — the human error rate 
is 2.5% [79]. For a discussion, see [496]. Note, moreover, that some of the results 
reported in the literature for the USPS set were obtained with an enhanced train- 
ing set: For instance, the study of Drucker et al. [148] used an enlarged training set 
of size 9709, containing some additional machine-printed digits, and found that 
this improves the accuracy on the test set. Similarly, Bottou and Vapnik [65] used 
a training set of size 9840. Since there are no machine-printed digits in the com- 
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Table 7.4 Summary of error rates on the USPS set. Note that two variants of this database 
are used in the literature; one of them (denoted by USPS") is enhanced by a set of machine- 
printed characters which have been found to improve the test error. Note that the virtual 
SV systems perform best out of all systems trained on the original USPS set. 
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monly used test set (size 2007), this addition distorts the original learning problem 
to a situation where results become somewhat hard to interpret. For our experi- 
ments, we only had the original 7291 training examples at our disposal. Of all the 
systems trained on this original set, the SVM system of Chapter 13 performs best. 


7.8.2 Universality of the Support Vector Set 


In the present section, we report empirical evidence that the SV set contains all 
the information necessary to solve a given classification task: Using the Support 
Vector algorithm to train three different types of handwritten digit classifiers, we 
observe that these types of classifiers construct their decision surface from small, 
strongly overlapping subsets of the database. 

To study the Support Vector sets for three different types of SV classifiers, we use 
the optimal kernel parameters on the USPS set according to Table 7.3. Table 7.5 
shows that all three classifiers use around 250 Support Vectors per two-class- 
classifier (less than 4% of the training set), of which there are 10. The total number 
of different Support Vectors of the ten-class-classifiers is around 1600. It is less than 
2500 (10 times the above 250), since for instance a particular vector that has been 
used as a positive SV (i.e., y; = +1 in (7.25)) for digit 7, might at the same time be 
a negative SV (y; = —1) for digit 1. 

Table 7.6 shows that the SV sets of the different classifiers have about 90% 
overlap. This surprising result has been reproduced on the MNIST OCR set [467]. 
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Table 7.5 First row: Total number of different SVs in three different ten-class-classifiers 
(i.e., number of elements of the union of the ten two-class-classifier SV sets), obtained by 
choosing different functions k in (7.25) and (7.29); Second row: Average number of SVs per 
two-class-classifier (USPS database size: 7291) (from [470]). 


Polynomial | RBF | Sigmoid 


total # of SVs | 1677 1498 1611 
average # of SVs | 24 é| 235 254 


Table 7.6 Percentage of the SV set of [column] contained in the SV set of [row]; for ten- 
class classifiers (top), and binary recognizers for digit class 7 (bottom) (USPS set) (from [470]). 


f Polynomial Sigmoid 


Using a leave-one-out procedure similar to Proposition 7.4, Vapnik and Watkins 
have put forward a theoretical argument for shared SVs. We state it in the follow- 
ing form: If the SV set of three SV classifiers had no overlap, we could obtain a 
fourth classifier which has zero test error. 

To see why this is the case, note that if a pattern is left out of the training set, 
it will always be classified correctly by voting between the three SV classifiers 
trained on the remaining examples: Otherwise, it would have been an SV of at least 
two of them, if kept in the training set. The expectation of the number of patterns 
which are SVs of at least two of the three classifiers, divided by the training set 
size, thus forms an upper bound on the expected test error of the voting system. 
Regarding error rates, it would thus in fact be desirable to be able to construct 
classifiers with different SV sets. An alternative explanation, studying the effect 
of the input density on the kernel, was recently proposed by Williams [597]. 
Finally, we add that the result is also plausible in view of the similar regularization 
characteristics of the different kernels that were used (see Chapter 4). 

As described in Section 7.3, the Support Vector set contains all the information a 
given classifier needs for constructing the decision function. Due to the overlap in 
the Support Vector sets of different classifiers, we can even train classifiers on the 
Support Vector set of another classifier; the latter having a different kernel to the 
former. Table 7.7 shows that this leads to results comparable to those after training 
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Table 7.7 Training classifiers on the Support Vector sets of other classifiers, leads to perfor- 
mances on the test set (USPS problem) which are as good as the results for training on the 
full database (number of errors on the 2007-element test set are shown, for two-class clas- 
sifiers separating digit 7 from the rest). Additionally, the results for training on a random 
subset of the database of size 200 are displayed. 


on the whole database. In Section 11.3, we will use this finding as a motivation for 
a method to make SVMs transformation invariant, to obtain virtual SV machines. 

What do these results concerning the nature of Support Vectors tell us? Learn- 
ing can be viewed as inferring regularities from a set of training examples. Much 
research has been devoted to the study of various learning algorithms, which al- 
low the extraction of these underlying regularities. No matter how different the 
outward appearance of these algorithms is, they must all rely on intrinsic regular- 
ities of the data. If the learning has been successful, these intrinsic regularities are 
captured in the values of certain parameters of a learning machine; for a polyno- 
mial classifier, these parameters are the coefficients of a polynomial, for a neural 
network, they are weights, biases, and gains, and for a radial basis function classi- 
fier, they are weights, centers, and widths. This variety of different representations 
of the intrinsic regularities, however, conceals the fact that they all stem from a 
common root. This is why SVMs with different kernel functions identify the same 
subset of the training examples as crucial for the regularity to be learned. 


7.8.3 Other Applications 


SVMs have been successfully applied in other computer vision tasks, which relate 
to the OCR problems discussed above. Examples include object and face detection 
and recognition, as well as image retrieval [57, 467, 399, 419, 237, 438, 99, 75]. 

Another area where SVMs have been used with success is that of text catego- 
rization. Being a high-dimensional problem, text categorization has been found to 
be well suited for SVMs. A popular benchmark is the Reuters-22173 text corpus. 
The news agency Reuters collected 21450 news stories from 1997, and partitioned 
and indexed them into 135 different categories. The feature typically used to clas- 
sify Reuters documents are 10*-dimensional vectors containing word frequencies 
within a document (sometimes called the “bag-of-words” representation of texts, 
as it completely discards the information on word ordering). Using this coding, 
SVMs have led to excellent results, see [155, 265, 267, 150, 333, 542, 149, 326]. 

Since the use of classification techniques is ubiquitous throughout technology, 
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we cannot give an exhaustive listing of all successful SVM applications. We thus 
conclude the list with some of the more exotic applications, such as in High- 
Energy-Physics [19, 558], in the monitoring of household appliances [390], in 
protein secondary structure prediction [249], and, with rather intriguing results, 
in the design of decision feedback equalizers (DFE) in telephony [105]. 


This chapter introduced SV pattern recognition algorithms. The crucial idea is 
to use kernels to reduce a complex classification task to one that can be solved 
with separating hyperplanes. We discussed what kind of hyperplane should be 
constructed in order to get good generalization performance, leading to the idea 
of large margins. It turns out that the concept of large margins can be justified 
in a number of different ways, including arguments based on statistical learning 
theory, and compression schemes. We described in detail how the optimal margin 
hyperplane can be obtained as the solution of a quadratic programming problem. 
We started with the linear case, where the hyperplane is constructed in the space 
of the inputs, and then moved on to the case where we use a kernel function to 
compute dot products, in order to compute the hyperplane in a feature space. 

Two further extensions greatly increase the applicability of the approach. First, 
to deal with noisy data, we introduced so-called slack variables in the optimization 
problem. Second, for problems that have more than just two classes, we described 
a number of generalizations of the binary SV classifiers described initially. 

Finally, we reported applications and benchmark comparisons for the widely 
used USPS handwritten digit task. SVMs turn out to work very well in this field, 
as well as in a variety of other domains mentioned briefly. 


7.1 (Weight Vector Scaling e) Show that instead of the “1” on the right hand side of the 
separation constraint (7.11), we can use any positive number y > 0, without changing the 
optimal margin hyperplane solution. What changes in the soft margin case? 


7.2 (Dual Perceptron Algorithm [175] ee) Kernelize the perceptron algorithm described 
in footnote 1. Which of the patterns will appear in the expansion of the solution? 


7.3 (Margin of Optimal Margin Hyperplanes [62] ee) Prove that the geometric mar- 
gin p of the optimal margin hyperplane can be computed from the solution a via 


m 


p= Dy Qj. (7.68) 
i=1 
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Also prove that 
p” =2W(a) = ||w||*. (7.69) 


Note that for these relations to hold true, œ needs to be the solution of (7.29). 


7.4 (Relationship Between ||w|| and the Geometrical Margin e) (i) Consider a sep- 
arating hyperplane in canonical form. Prove that the margin, measured perpendicularly to 
the hyperplane, equals 1/||w]||, by considering two opposite points which precisely satisfy 
| (w, xi) +b] =1. 

(ii) How does the corresponding statement look for the case of v-SVC? Use the con- 
straint (7.41), and assume that all slack variables are 0. 


7.5 (Compression Bound for Large Margin Classification 000) Formalize the ideas 
stated in Section 7.2: Assuming that the data are separable and lie in a ball of radius R, 
how many bits are necessary to encode the labels of the data by encoding the parameters 
of a hyperplane? Formulate a generalization error bound in terms of the compression ratio 
by using the analysis of Vapnik [561, Section 4.6]. Compare the resulting bound with 
Theorem 7.3. Take into account the eigenvalues of the Gram matrix, using the ideas of 
from [604] (cf. Section 12.4). 


7.6 (Positive Definiteness of the SVC Hessian e) From Definition 2.4, prove that the 
matrix Qij := (yiy jK(x;,x;))ij is positive definite. 


7.7 (Geometric Interpretation of Duality in SVC [38] ee) Prove that the program- 
ming problem (7.10), (7.11) has the same solution as (7.22), provided the threshold b is 
adjusted such that the hyperplane bisects the shortest connection of the two convex hulls. 
Hint: Show that the latter is the dual of the former. Interpret the result geometrically. 


7.8 (Number of Points Required to Define a Hyperplane e) From (7.22), argue that 
no matter what the dimensionality of the space, there can always be situations where two 
training points suffice to determine the optimal hyperplane. 


7.9 (In-Bound SVs in Soft Margin SVMs e) Prove that in-bound SVs lie exactly on 
the margin. Hint: Use the KKT conditions, and proceed analogously to Section 7.3, where 
it was shown that in the hard margin case, all SVs lie exactly on the margin. 

Argue, moreover, that bound SVs can lie both on or in the margin, and that they will 
“usually” lie in the margin. 


7.10 (Pattern-Dependent Regularization e) Derive a version of the soft margin classi- 
fication algorithm which uses different regularization constants C; for each training exam- 
ple. Start from (7.35), replace the second term by +S", C;&;, and derive the dual. Discuss 


both the mathematical form of the result, and possible applications (cf. [462]). 


7.11 (Uncertain Labels ee) In this chapter, we have been concerned mainly with the case 
where the patterns are assigned to one of two classes, i.e., y € {41}. Consider now the 
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case where the assignment is not strict, i.e., y E€ [—1, 1]. Modify the soft margin variants 
of the SV algorithm, (7.34), (7.35) and (7.41), (7.40), such that 


m whenever y = 0, the corresponding pattern has effectively no influence 
a if all labels are in {+1}, the original algorithm is recovered 


= if |y| < 1, then the corresponding pattern has less influence than it would have for 
ly|=1. 


7.12 (SVMs vs. Parzen Windows ooo) Develop algorithms that approximate the SVM 
(soft or hard margin) solution by starting from the Parzen Windows algorithm (Figure 1.1) 
and sparsifying the expansion of the solution. 


7.13 (Squared Error SVC [111] ee) Derive a version of the soft margin classification 
algorithm which penalizes the errors quadratically. Start from (7.35), replace the second 
term by 4 >, €?, and derive the dual. Compare the result to the usual C-SVM, both in 
terms of algorithmic differences and in terms of robustness properties. Which algorithm 
would you expect to work better for Gaussian-like noise, which one for noise with longer 
tails (and thus more outliers) (cf. Chapter 3)? 


7.14 (C-SVC with Group Error Penalty ee) Suppose the training data are partitioned 
into L groups, 


(XT, Yio OT YT") 


(Xb, Y) e (XP VED), (7.70) 


where x! € Hand yÍ € {+1} (it is understood that the index i runs over {1,2,..., €} and 
the index j runs over {1,2,..., m;}). 

Suppose, moreover, that we would like to count a point as misclassified already if one 
point belonging to the same group is misclassified. 

Design an SV algorithm where each group’s penalty equals the slack of the worst point 
in that group. 

Hint: Use the objective function 


1 

slwl? + Cié; (7.71) 

and the constraints 

yj- (wxi) +0) > 1-&, (7.72) 
& > 0. (7.73) 


Show that the dual problem consists of maximizing 


> 4 ae eee 
wa)=Dal—2 > alalyly! (dxi), 7.74) 
1] 
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subject to 


0=Yaly!, 0<ai, and yal <c. (7.75) 
ij j 
Argue that typically, only one point per group will become an SV. 
Show that C-SVC is a special case of this algorithm. 


7.15 (v-SVC with Group Error Penalty eee) Derive a v-version of the algorithm in 
Problem 7.14. 


7.16 (C-SVC vs. v-SVC ee) As a modification of v-SVC (Section 7.5), compute the dual 
of T(w, £, p) = ||w||?/2 + C(—vp + (1/m) D7, £) (note that in v-SVC, C = 1 is used). 
Argue that due to the homogeneity of the objective function, the dual solution gets scaled 
by C, however, the decision function will not change. Hence we may set C = 1. 


7.17 (Multi-class vs. Binary SVC [593] ee) (i) Prove that the multi-class SVC formu- 
lation of (7.59) specializes to the binary C-SVC (7.35) in the case k = 2, by using 
W1 = —W,, bı = —bo, and & = FE for pattern x; in class r. (ii) Derive the dual of (7.59). 


7.18 (Multi-Class v-SVC 000) Derive a v-version of the approach described in Sec- 
tion 7.6.4. 


7.19 (LPM with Constrained Signs e) Modify the LPM algorithm such that it is guar- 
anteed that each expansion pattern will have a coefficient v; whose sign equals the class 
label y;. Hint: Do not introduce additional constraints, but eliminate the a; variables and 
use a different ansatz for the solution. 


7.20 (Multi-Class LPM [593] ee) In analogy to Section 7.6.4, develop a multi-class ver- 
sion of the LP machine (Section 7.7). 


7.21 (Version Space [368, 239, 451, 238] eee) Consider hyperplanes passing through 
the origin, {x| (w,x) = 0}, with weight vectors w € K, ||w|| = 1. The set of all such hy- 
perplanes forms a unit sphere in weight space. Each training example (x, y) E€ H x {+1} 
splits the sphere into two halves: one that correctly classifies (x, y), i.e., sgn (W,x) = y, 
and one that does not. Each training example thus corresponds to a hemisphere (or, equiv- 
alently, an oriented great circle) in weight space, and a training set (x1, Y1),..+ 5 (Xm, Ym) 
corresponds to the intersection of m hemispheres, called the version space. 


1. Discuss how the distances between the training example and the hyperplane in the two 
representations are related. 

2. Discuss the relationship to the idea of the Hough transform [255]. The Hough transform 
is sometimes used in image processing to detect lines. In a nutshell, each point gets to cast 
votes in support for all potential lines that are consistent with it, and at the end, the lines 
can be read off the histogram of votes. 

3. Prove that if all x; have the same norm, the maximum margin weight vector corresponds 
to the center of the largest m — 1-dimensional sphere that fits into version space. 
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4. Construct situations where the center of the above largest sphere will generalize poorly, 
and compare it to the center of mass of version space, called the Bayes point. 


5. Ifyou disregard the labels of the training examples, there is no longer a single area on the 
unit sphere which is distinguished from the others due to its corresponding to the correct 
labelling. Instead, the sphere is split into a number of cells. Argue that the expectation of 
the natural logarithm of this number equals the VC entropy (Section 5.5.6). 


7.22 (Kernels on Sets 000) Use the construction of Proposition 2.19 to define a kernel 
that compares two points x,x' € H by comparing the version spaces (see Problem 7.21) 
of the labelled examples (x,1) and (x',1). Define a prior distribution P on the unit sphere 
in H, and discuss the implications of its choice for the induced kernel. What can you say 
about the connection between this kernel and the kernel (x, x’)? 


7.23 (Training Algorithms for v-SVC ooo) Try to come up with efficient training al- 
gorithms for v-SVC, building on the material presented in Chapter 10. 

(i) Design a simple chunking algorithm that gradually removes all non-SVs. 

(ii) Design a decomposition algorithm. 

(iii) Is it possible to modify the SMO algorithm such that it deals with the additional 
equality constraint that v-SVC comes with? What is the smallest set of patterns that 
you can optimize over without violating the two equality constraints? Can you design 
a generalized SMO algorithm for this case? 


7.24 (Prior Class Probabilities ee) Suppose that it is known a priori that n+ and m- 
are the probabilities that a pattern belongs to the class +1, respectively. Discuss ways 
of modifying the simple classification algorithm described in Section 1.2 to take this 
information into account. 


7.25 (Choosing C ee) Suppose that R is a measure for the range of the data in feature 
space that scales like the length of the points in H (cf. Section 7.8.1). Argue that C should 
scale like 1/R?.!° Hint: consider scaling the data by some y > 0. How do you have to 
scale C such that f(x) = (w, @(x;)) + b (where w = X; ajy;P(x;)) remains invariant 
je [m])?16 Discuss measures R that can be used. Why does R := max; k(xj;, xj) not 
make sense for the Gaussian RBF kernel? 

Moreover, argue that in the asymptotic regime, the upper bound on the aj; should scale 
with 1/m, justifying the use of m in (7.38). 


7.26 (Choosing C, Part II 000) Problem 7.25 does not take into account the class labels, 
and hence also not the potential overlap of the two classes. Note that this is different in 
the v-approach, which automatically scales the margin with the noise. Can you modify the 
recommendation in Problem 7.25 to get a selection criterion for C which takes into account 
the labels, e.g., in the form of prior information on the noise level? 


15. Thanks to Olivier Chapelle for this suggestion. 
16. Note that in the v-parametrization, this scale invariance comes for free. 
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This chapter describes an SV approach to the problem of novelty detection and 
high-dimensional quantile estimation [475]. This is an unsupervised problem, which 
can be described as follows. Suppose we are given some dataset drawn from an 
underlying probability distribution P, and we want to estimate a “simple” subset S 
of input space, such that the probability that a test point drawn from P lies outside 
of S equals some a priori specified value between 0 and 1. 

We approach the problem by trying to estimate a function f which is positive 
on S and negative on the complement. The functional form of f is given by a ker- 
nel expansion in terms of SVs; it is regularized by controlling the length of the 
weight vector in an associated feature space (or, equivalently, by maximizing a 
margin). The expansion coefficients are found by solving a quadratic program- 
ming problem, which can be done by carrying out sequential optimization over 
pairs of input patterns. We also state theoretical results concerning the statistical 
performance. The algorithm is a natural extension of the Support Vector classifi- 
cation algorithm, as described in the previous chapter, to the case of unlabelled 
data. 

The chapter is organized as follows. After a review of some previous work in 
Section 8.2, taken from [475], we describe SV algorithms for single class problems. 
Section 8.4 gives details of the implementation of the optimization procedure, 
specifically for the case of single-class SVMs. Following this, we report theoretical 
results characterizing the present approach (Section 8.5). In Section 8.7, we deal 
with the application of the algorithm to artificial and real-world data. 

The prerequisites of the chapter are almost identical to the previous chapter. 
Those who have read Chapter 7, should be fine with the current chapter. Sec- 
tion 8.2 requires some knowledge of probability theory, as explained in Section B.1; 
readers who are only interested in the algorithms, however, can skip this slightly 
more technical section. Likewise, there are some technical parts of Section 8.5 
which would benefit from knowledge of Chapter 5, but these can be skipped if 
desired. 
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l 8.4 Implementation ) l 8.7 Experiments ) 


8.5, 8.6 Theoretical 
Analysis 


There have been a number of attempts to transfer the idea of using kernels to com- 
pute dot products in feature spaces to the domain of unsupervised learning. The 
problems in this domain are, however, less precisely specified. Generally, they can 
be characterized as estimating functions of the data which tell you something inter- 
esting about the underlying distributions. For instance, kernel PCA (Chapter 14) 
can be described as computing functions which on the training data produce unit 
variance outputs while having minimum norm in feature space. Another kernel- 
based unsupervised learning technique, regularized principal manifolds (Chap- 
ter 17), computes functions which give a mapping onto a lower-dimensional man- 
ifold minimizing a regularized quantization error. Clustering algorithms are fur- 
ther examples of unsupervised learning techniques which can be kernelized [480]. 

An extreme point of view is that unsupervised learning is about estimating the 
density of the distribution P generating the data. Knowledge of the density would 
then allow us to solve whatever problem can be solved on the basis of data sampled 
from that density. 

The present chapter addresses an easier problem: it proposes an algorithm 
that computes a binary function which is supposed to capture regions in input 
space where the probability density is in some sense large (its support, or, more 
generally, quantiles); that is, a function which is nonzero in a region where most 
of the data are located. In doing so, this method is in line with Vapnik’s principle 
never to solve a problem which is more general than the one we actually need 
to solve [561]. Moreover, it is also applicable in cases where the density of the 
data’s distribution is not even well-defined, as can be the case if the distribution 
has singular components. 
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Quantile 
Function 


Support of a 
Distribution 


In order to describe previous work, it is convenient to introduce the following 
definition of a (multi-dimensional) quantile function [158]. Let x1,..., Xm be an iid 
sample of a random experiment in a set X with distribution P. Let € be a class 
of measurable subsets of X and let À be a real-valued function defined on €. The 
quantile function with respect to (P, A, €) is 


U(u) = inf{A(C)[P(C) > w,C EC}, O<w<l. (8.1) 


Loosely speaking, the quantile function measures how large a set one needs in 
order to capture a certain amount of probability mass of P. 
An interesting special case is the empirical quantile function, where P is the em- 
pirical distribution 
1 m 
Pemp(C) = — 2 Ic(xi), (8.2) 
i=l 
which is the fraction of observations that fall into C. 

We denote by C, (u) and C¥ (u) the (not necessarily unique) C € € that attain the 
infimum (when it is achievable). Intuitively speaking, these are the smallest sets 
(where size is measured by A) which contain a probability mass p. 

The most common choice of A is Lebesgue measure (loosely speaking, the volume 
of the set C), in which case C)(js) is the minimum volume set C € € that contains at 
least a fraction u of the probability mass. Estimators of the form C¥ (u) are called 
minimum volume estimators. Of course, it is not sufficient that the estimated set 
have a small volume and contain a fraction ju of the training examples. In machine 
learning applications, we want to find a set that contains a fraction of test examples 
that is close to u. This is where the complexity trade-off enters (see Figure 8.1), 
as with the methodology that we have already described in a number of learning 
scenarios. On the one hand, we want to use a large class C, to ensure that it contains 
sets C which are very small yet can contain a fraction u of training examples. 
On the other hand, if we allowed just any set, the chosen set C could consist of 
only the training points (we would then “memorize” the training points), and it 
would generalize poorly to test examples; in other words, it would not contain a 
large probability mass P(C). Therefore, we have to consider classes of sets which 
are suitably restricted. As we will see below, this can be achieved using an SVM 
regularizer. 

Observe that for € being all measurable sets, and A being the Lebesgue measure, 
C)(1) is the support of the density p corresponding to P, assuming it exists (note 
that C)(1) is well defined even when p does not exist). For smaller classes C, C)(1) 
is the minimum volume C € € containing the support of p. In the case where 
u < 1, it seems the first work was reported in [454, 229], in which X = R’, with € 
being the class of closed convex sets in X (they actually considered density contour 
clusters; cf. [475] for a definition). Nolan [385] considered higher dimensions, with 


230 


Single-Class Problems: Quantile Estimation and Novelty Detection 


Figure 8.1 A single-class toy problem, with two different solutions. The left graph depicts 
a rather complex solution, which captures all training points (thus, Pnp of the estimated 
region equals 1, cf. (8.2)), while having small volume in R?. On the right, we show a solution 


which misses one training point (it does not capture all of Pemp), but since it is “simpler,” 


it is conceivable that it will nevertheless capture more of the true underlying distribution P 
that is assumed to have generated the data. In the present context, a function A is used to 
measure the simplicity of the estimated region. In the algorithm described below, A is a SV 
style regularizer. 


C being the class of ellipsoids. Tsybakov [550] studied an estimator based on piece- 
wise polynomial approximation of C(u) and showed it attains the asymptotically 
minimax rate for certain classes of densities. Polonik [417] studied the estimation 
of C) (1) by C¥' (u). He derived asymptotic rates of convergence in terms of various 
measures of richness of C. More information on minimum volume estimators can 
be found in that work, and in [475]. 

Let us conclude this section with a short discussion of how the present work 
relates to the above. The present chapter describes an algorithm which finds 
regions close to C)(js). Our class € is defined implicitly via a kernel k as the set 
of half-spaces in a SV feature space. We do not try to minimize the volume of 
C in input space. Instead, we minimize a SV style regularizer which, using a 
kernel, controls the smoothness of the estimated function describing C. In terms of 
multi-dimensional quantiles, the present approach employs \(Cw) = ||w||?, where 
Cw = {xl fw(x) > p}, and (w, p) are respectively a weight vector and an offset 
parametrizing a hyperplane in the feature space associated with the kernel. 


8.3 Algorithms 


We consider unlabelled training data 
MS {Miers te (8.3) 


where m € N is the number of observations, and X is some set. For simplicity, we 
think of it as a compact subset of RY. Let ® be a feature map X — K; in other 
words, a map into a dot product space H such that the dot product in the image 
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of ® can be computed by evaluating some simple kernel (Chapters 2 and 4), 
k(x, x’) = (®(x), (x’)) , (8.4) 
such as the Gaussian, 
k(x, x!) = eTl =e, (8.5) 
Indices i and j are understood to range over 1,...,m (in compact notation: i, j € 


[m]). Bold face Greek letters denote m-dimensional vectors whose components are 
labelled using normal face type. 

In the remainder of this section, we shall describe an algorithm which returns 
a function f that takes the value +1 in a “small” region capturing most of the 
data points, and —1 elsewhere. The strategy, inspired by the previous chapter, is 
to map the data into the feature space corresponding to the kernel, and to separate 
them from the origin with maximum margin. For a new point x, the value f(x) 
is determined by evaluating which side of the hyperplane it falls on, in feature 
space. Due to the freedom to utilize different types of kernel functions, this simple 
geometric picture corresponds to a variety of nonlinear estimators in input space. 

To separate the data set from the origin, we solve the following quadratic 
program: 


tees. Mii gla, 
_uinamize 5 lll Emate (8.6) 


subject to (w, ®(x;)) > p — ĉi, & >20. (8.7) 


Here, v € (0,1] is a parameter which is introduced in close analogy to the v-SV 
classification algorithm detailed in the previous chapter. Its meaning will become 
clear later. 

Since nonzero slack variables €; are penalized in the objective function, we can 
expect that if w and p solve this problem, then the decision function, 


f(x) = sgn ((w, ®(x)) — p), (8.8) 


will equal 1 for most examples x; contained in the training set,! while the regular- 

ization term ||w|| will still be small. For an illustration, see Figure 8.2. As in v-SVC 

(Section 7.5), the trade-off between these two goals is controlled by a parameter v. 
Using multipliers a;, 3; > 0, we introduce a Lagrangian, 


Low, E pap) = sll)? +E & — p— E allw, (x) — p+ &) — DBs, 89) 


and set the derivatives with respect to the primal variables w, €, p equal to zero, 
yielding 


w= >? ai®(xi), (8.10) 


1. We use the convention that sgn (z) equals 1 for z > 0 and —1 otherwise. 
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Figure 8.2 In the 2-D toy example depicted, the hyperplane (w, ®(x)) = p (with normal 
vector w and offset p) separates all but one of the points from the origin. The outlier ®(x) is 
associated with a slack variable €, which is penalized in the objective function (8.6). The 
distance from the outlier to the hyperplane is &/||w||; the distance between hyperplane 
and origin is p/|{w||. The latter implies that a small ||w|| corresponds to a large margin 
of separation from the origin. 


1 1 
o= a ae Zaa (8.11) 


Eq. (8.10) is the familiar Support Vector expansion (cf. (7.15)). Together with (8.4), 
it transforms the decision function (8.8) into a kernel expansion, 


f(x) = sgn (5 auk(x;, x) — p) . (8.12) 
Substituting (8.10)-(8.11) into L (8.9), and using (8.4), we obtain the dual problem: 
PEO. 
minimize 7 210k) (8.13) 
1 
subject to 0 < aj < —, (8.14) 
vm 


doi = 4, (8.15) 


We can show that at the optimum, the two inequality constraints (8.7) become 
equalities if a; and (3; are nonzero, which implies 0 < a; < 1/(vm) (KKT condi- 
tions). Therefore, we can recover p by exploiting that for any such aj, the corre- 
sponding pattern x; satisfies 


p = (w, ®(x;)) = È, ajk(xj, xi). (8.16) 
j 


Note that if v approaches 0, the upper bounds on the Lagrange multipliers tend to 
infinity and the second inequality constraint in (8.14) becomes void. We then have 
a hard margin problem, since the penalization of errors becomes infinite, as can be 
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seen from the primal objective function (8.6). The problem is still feasible, since we 
have placed no restriction on the offset p, so p can become a large negative number 
in order to satisfy (8.7). 

It is instructive to compare (8.13)-(8.15) to a Parzen windows estimator (cf. 
page 6). To this end, suppose we use a kernel which can be normalized as a 
density in input space, such as the Gaussian (8.5). If we use v = 1 in (8.14), 
then the two constraints only allow the solution a = ... = Qm = 1/m. Thus the 
kernel expansion in (8.12) reduces to a Parzen windows estimate of the underlying 
density. For v < 1, the equality constraint (8.15) still ensures that the decision 
function is a thresholded density; in that case, however, the density will only 
be represented by a subset of training examples (the SVs) — those which are 
important for the decision (8.12) to be taken. Section 8.5 will explain the precise 
meaning of the parameter v. 

To conclude this section, we note that balls can also be used to describe the data 
in feature space, close in spirit to the algorithms described in [470], with hard 
boundaries, and [535], with “soft margins” (cf. also the algorithm described in 
Section 5.6). Again, we try to put most of the data into a small ball by solving 


1 
minimize R*+—) é;for0<v<1 
RER ER" EH vm 4 


subject to ||®(x;) — c||* < R? + ĉi and €; > 0 fori € [m]. (8.17) 
This leads to the dual, 

ei aja jk(x;,x;) — 2 aik(xi, xi), (8.18) 
1 

subject to 0 < a; < aa and » a;=1, (8.19) 


and the solution 


c= oy a; P(x), (8.20) 


corresponding to a decision function of the form 
f(x) = sgn (« — 5 aiajk(xi, xj) +2 5 aik(xi, x) — k(x, ») : (8.21) 
ij i 


As above, R? is computed such that for any x; with 0 < a; < 1/(vm) the argument 
of the sgn is zero. 

For kernels k(x, x’) which only depend on x — x’ (the translation invariant ker- 
nels, such as RBF kernels), k(x, x) is constant. In this case, the equality constraint 
implies that the linear term in the dual target function (8.18) is constant, and the 
problem (8.18-8.19) turns out to be equivalent to (8.13-8.15). It can be shown that 
the same holds true for the decision function, hence the two algorithms coincide in 
this case. This is geometrically plausible: for constant k(x, x), all mapped patterns 
lie on a sphere in feature space. Therefore, finding the smallest ball containing the 
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Figure 8.3 For RBF kernels, which de- 
pend only on x — x’, k(x,x) is constant, 
and the mapped data points thus lie on 
a hypersphere in feature space. In this 
case, finding the smallest sphere enclos- 
ing the data is equivalent to maximizing 
the margin of separation from the origin 
(cf. Figure 8.2). 
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points amounts to finding the smallest segment of the sphere on which the data 
lie. This segment, however, can be found in a straightforward way by simply in- 
tersecting the data sphere with a hyperplane — the hyperplane with maximum 
margin of separation from the origin cuts off the smallest segment (Figure 8.3). 


8.4 Optimization 


The previous section formulated quadratic programs (QPs) for computing regions 
that capture a certain fraction of the data. These constrained optimization prob- 
lems can be solved via an off-the-shelf QP package (cf. Chapter 6). In the present 
section, however, we describe an algorithm which takes advantage of the precise 
form of the QPs [475], which is an adaptation of the SMO (Sequential Minimal Op- 
timization) algorithm [409]. Although most of the material on implementations is 
dealt with in Chapter 10, we will spend a few moments to describe the single 
class algorithm here. Further information on SMO in general can be found in Sec- 
tion 10.5; additional information on single-class SVM implementations, including 
variants which work in an online setting, can be found in Section 10.6.3. 

The SMO algorithm has been reported to work well in C-SV classification. The 
latter has a structure resembling the present optimization problem: just as the dual 
of C-SV classification (7.37), the present dual also has only one equality constraint 
(8.15).2 

The strategy of SMO is to break up the constrained minimization of (8.13) into 
the smallest optimization steps possible. Note that it is not possible to modify 
variables a; individually without violating the sum constraint (8.15). We therefore 
resort to optimizing over pairs of variables. 


2. The v-SV classification algorithm (7.49), on the other hand, has two equality constraints. 
Therefore, it is not directly amenable to an SMO approach, unless we remove the equality 
constraint arising from the offset b, as done in [98]. 
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Thus, consider optimizing over two variables a; and a; with all other variables 
fixed. Using the shorthand K;; := k(x;, xj), (8.13)-(8.15)) then reduce to (up to a 
constant) 

1 
minimize 5 [o?Kii + 05 Kj + 2aiajKii| +cjaj + CjQj 
Oj Qj 
subject to ajt+aj= 7 (8.22) 


0<a,aji<t 


vm 
in analogy to (10.63) below. Here the constants c;, cj, and y are defined as follows; 
m m m 
e:= Ù, aKu, cj:= X aKa, and y=1- Ñ a. (8.23) 
IŻi,j Iżi,j Iżi,j 
To find the minimum, we use a; + aj = y. This allows us to obtain a constrained 
optimization problem in q; alone by elimination of aj. For convenience we intro- 
duce y := Ki; + Kjj = 2Kij. 


minimize ax + ai (c) — cj +27(Kij — Kjj)) 
subject to L< a; < H, where L = max(0,y—1/(vm)) and H = min(1/(vm),7). 


Without going into details (a similar calculation can be found in Section 10.5.1) the 
minimizer a; of this optimization problem is given by 


ai; = min(max(L, &;), H). (8.24) 

where @;, the unconstrained solution, is given by 

i= ald a y G zg + Kjjao" iKi Ci _ a?) = Kia") (8.25) 
=a + | pa -A (8.26) 


Finally, a; can be obtained via aj = y — a;i. Eq. (8.26) tells us that the change in 
a; will depend on the difference between the values f(x;) and f(x;). The less close 
these values are, i.e., the larger the difference in the distances to the hyperplane, 
the larger the possible change in the set of variables. Note, however, that there 
is no guarantee that the actual change in a; will indeed be large, since a; has to 
satisfy the constraint L < a; < H. Finally, the size of x plays an important role, 
too (for the case of y = 0 see Lemma 10.3). The larger it is, the smaller the likely 
change in qj. 
We next briefly describe how to do the overall optimization. 


Initialization of the Algorithm We start by setting a random fraction v of all 
ai to 1/(vm). If vm is not an integer, then one of the examples is set to a 
value in (0,1/(vm)) to ensure that ¥;a; = 1. Furthermore, we set the initial p 
to max{f(x;)|i € [m], a; > O}. 

Optimization Algorithm We then select the first variable for the elementary opti- 
mization step in one of two following ways. Here, we use the shorthand SV,» for 
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the indices of variables which are not at bound (see also Section 10.5.5 for a more 
detailed description of such a strategy), 


SV» := {ili € [m], 0 < a; < 1/(vm)}. (8.27) 


These correspond to points which will sit exactly on the hyperplane once opti- 
mization is complete, and which will therefore have a strong influence on its pre- 
cise position. 
(i) We scan over the entire data set? until we find a variable that violates a 
KKT condition (Section 6.3.1); in other words, a point such that (O; — p) +a; >0 
or (p — O;): (1/(vm) — aj) > 0. Calling this variable aj, we pick aj according 
to 
j= argmax |O; — O;|- (8.28) 
nESVab 
(ii) Same as (i), but the scan is only performed over SV ,p. 


In practice, one scan of type (i) is followed by multiple scans of type (ii), until 
there are no KKT violators in SV», whereupon the optimization goes back to a 
single scan of type (i). If the type (i) scan finds no KKT violators, the optimization 
algorithm terminates. 


In unusual circumstances, the choice heuristic (8.28) cannot make positive progress. 
Therefore, a hierarchy of other choice heuristics is applied to ensure positive 
progress. These other heuristics are the same as in the case of pattern recogni- 
tion, cf. Chapter 10 and [409], and were found to work well in the experiments 
reported below. 

We conclude this section by stating a trick which is of importance in implemen- 
tations. In practice, we must use a nonzero accuracy tolerance in tests for equality 
of numerical quantities. In particular, comparisons of this type are used in de- 
termining whether a point lies on the margin. Since we want the final decision 
function to return 1 for points which lie on the margin, we need to subtract this 
tolerance from p at the end. 

In the next section, it will be argued that subtracting something from p is also 
advisable from a statistical point of view. 


8.5 Theory 


Outlier 


We now analyze the algorithm theoretically, starting with the uniqueness of the 
hyperplane (Proposition 8.1). We describe the connection to pattern recognition 
(Proposition 8.2), and show that the parameter v characterizes the fractions of 
SVs and outliers. The latter term refers to points which are on the wrong side of 


3. This scan can be accelerated by not checking patterns which are on the correct side of the 
hyperplane by a large margin, using the method of [266]. 
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Figure 8.4 A separable data 
set, with the unique supporting 
hyperplane separating the data 
from the origin with maximum 
margin. 


the hyperplane (Proposition 8.3). Following this, we give a robustness result for 
the soft margin (Proposition 8.4) and we present a theoretical result regarding the 
generalization error (Theorem 8.6). 

As in some of the earlier chapters, we will use boldface letters to denote the 
feature space images of the corresponding patterns in input space, 


Xj := @(x;). (8.29) 
We will call a data set 
X:= {x1, se Sut (8.30) 


separable if there exists some w € H such that (w,x;) > 0 for i € [m] (see also 
Lemma 6.24). If we use a Gaussian kernel (8.5), then any data set x1,...,Xm is 
separable after it is mapped into feature space, since in this case, all patterns lie 
inside the same orthant and have unit length (Section 2.3). 

The following proposition is illustrated in Figure 8.4. 


Proposition 8.1 (Supporting Hyperplane) If the data set X is separable, then there 
exists a unique supporting hyperplane with the properties that (1) it separates all data 
from the origin, and (2) its distance to the origin is maximal among all such hyperplanes. 
For any p > 0, the supporting hyperplane is given by 


minimize sll? subject to (w,x;) > p, i€ [ml]. (8.31) 
we 


Proof Due to the separability, the convex hull of the data does not contain the 
origin. The existence and uniqueness of the hyperplane then follows from the 
supporting hyperplane theorem [45, e.g.]. 

In addition, separability implies that there actually exists some p > 0 and w € H 
such that (w,x;) > p for i € [m] (by rescaling w, this can be seen to work for 
arbitrarily large p). The distance from the hyperplane {z € H : (w,z) = p} to the 
origin is p/||w||. Therefore the optimal hyperplane is obtained by minimizing ||w]| 
subject to these constraints; that is, by the solution of (8.31). | 


The following result elucidates the relationship between single-class classification 
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and binary classification. 


Proposition 8.2 (Connection to Pattern Recognition) (i) Suppose (w, p) parametrizes 
the supporting hyperplane for the data X. Then (w, 0) parametrizes the optimal separating 
hyperplane for the labelled data set 


{(x1, 1), ..., (Xm, 1), (—x1, -1),- .-,(—Xm, —1)}. (8.32) 


(ii) Suppose (w,0) parametrizes the optimal separating hyperplane passing through the 
origin for a labelled data set 


{(X1, y1), - <- Xm, Ym)}, (yi € {+1} for i € [m)]), (8.33) 


aligned such that (w,x;) is positive for y; = 1. Suppose, moreover, that p/||w|| is the 
margin of the optimal hyperplane. Then (w, p) parametrizes the supporting hyperplane for 
the unlabelled data set X' = {y1x1,..-, YmXm }- 


Proof Ad (i). By construction, the separation of X’ is a point-symmetric problem. 
Hence, the optimal separating hyperplane passes through the origin, as if it did 
not, we could obtain another optimal separating hyperplane by reflecting the first 
one with respect to the origin — this would contradict the uniqueness of the 
optimal separating hyperplane. 

Next, observe that (—w, p) parametrizes the supporting hyperplane for the data 
set reflected through the origin, and that it is parallel to that given by (w, p). 
This provides an optimal separation of the two sets, with distance 2p/||w||, and 
a separating hyperplane (w, 0). 

Ad (ii). By assumption, w is the shortest vector satisfying y;(w,x;) > p (note 
that the offset is 0). Hence, equivalently, it is also the shortest vector satisfying 
(w, yixi) > p fori € [m]. a 


Note that the relationship is similar for nonseparable problems. In this case, margin 
errors in binary classification (points which are either on the wrong side of the 
separating hyperplane or which fall inside the margin) translate into outliers in 
single-class classification, which are points that fall on the wrong side of the 
hyperplane. Proposition 8.2 then holds, cum grano salis, for the training sets with 
margin errors and outliers, respectively, removed. 

The utility of Proposition 8.2 lies in the fact that it allows us to recycle certain 
results from binary classification (Chapter 7) for use in the single-class scenario. 
The following property, which explains the significance of the parameter y, is such 
a case. 


Proposition 8.3 (v-Property) Assume the solution of (8.6),(8.7) satisfies p # 0. The 
following statements hold: 

(i) v is an upper bound on the fraction of outliers. 

(ii) v is a lower bound on the fraction of SVs. 

(iti) Suppose the data X were generated independently from a distribution P(x) which does 
not contain discrete components. Suppose, moreover, that the kernel is analytic and non- 
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frac. SVs/OLs 0.59, 0.47 0.24, 0.03 0.65, 0.38 
margin p/w | 034 


Figure 8.5 First two pictures: A single-class SVM applied to two toy problems; v = c = 0.5, 
domain: [—1,1]*. Note how in both cases, at least a fraction 1 — v of all examples is in the 
estimated region (cf. table). The large value of v causes the additional data points in the 
upper left corner to have almost no influence on the decision function. For smaller values 
of v, such as 0.1 (third picture), these points can no longer be ignored. Alternatively, we 
can force the algorithm to take these ‘outliers’ (OLs) into account by changing the kernel 
width (8.5): in the fourth picture, using c = 0.1, v = 0.5, the data are effectively analyzed on 
a different length scale, which leads the algorithm to consider the outliers as meaningful 
points. 


constant. With probability 1, asymptotically, v equals both the fraction of SVs and the 
fraction of outliers. 


The proof can be found in [475]. The result also applies to the soft margin ball 
algorithm of [535], provided that it is stated in the v-parametrization given in 
(8.17). Figure 8.5 displays a 2-D toy example, illustrating how the choice of v and 
the kernel width influence the solution. 


Proposition 8.4 (Resistance) Local movements of outliers parallel to w do not change 
the hyperplane. 


Proof (Proposition 8.4) Suppose x, is an outlier, for which £, > 0; hence by the 
KKT conditions (Chapter 6) a, = 1/(vm). Transforming it into x/, := x, + ô - w, 
where |6| < &/||w||, leads to a slack which is still nonzero, £; > 0, hence we still 
have a) = 1/(vm). Therefore, a’ = a is still feasible, as is the primal solution 
(w’, €’, p’). Here, we use £! = (1 + ô - ao)&; for i £ 0, w’ = (1 + ô - ao)w, and p’ as 
computed from (8.16). Finally, the KKT conditions are still satisfied, as a/, = 1 / (vm) 
still holds. Thus (Section 6.3), œ remains the solution. a 


Note that although the hyperplane does not change, its parametrization in w and 
p is different. In single-class SVMs, the hyperplane is not constrained to be in 
canonical form as it was in SV classifiers (Definition 7.1). 

We now move on to the subject of generalization. The goal is to bound the 
probability that a novel point drawn from the same underlying distribution lies 
outside of the estimated region. As in the case of pattern recognition, it turns out 
that the margin plays a central role. In the single-class case there is no margin 
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between the two classes, for the simple reason that there is just one class. We 
can nevertheless introduce a “safety margin” and make a conservative statement 
about a slightly larger region than the one estimated. In a sense, this is not so 
different from pattern recognition: in Theorem 7.3, we try to separate the training 
data into two half-spaces separated by a margin, and then make a statement about 
the actual test error (rather than the margin test error); that is, about the probability 
that a new point will be misclassified, no matter whether it falls inside the margin 
or not. Just as in the single-class case, the statement regarding the test error thus 
refers to a region slightly larger than the one in which we try to put the training 
data. 


Definition 8.5 Let f be a real-valued function on a space X. Fix 0 € R For x € X let 
d(x, f, 9) = max{0, 0 — f(x)}. Similarly, for a training sequence X := (x1,...,Xm), we 
define 


D(X, f0) = F, x, f,O). (8.34) 
xEX 

Theorem 8.6 (Generalization Error Bound) Suppose we are given a set of m exam- 
ples X € X”, generated from an unknown distribution P that does not have discrete 
components. Suppose, moreover, that we solve the optimization problem (8.6),(8.7) (or 
equivalently (8.13)-(8.15)) and obtain a solution fw „ given explicitly by (8.12). Let 
Rw,p := {x|fw(x) > p} denote the induced decision region. With probability 1 — ô over 
the draw of the random sample X € X”, for any y > 0, 


ied 2 m? 
P {x'|x" Z Rwy} < m k + log, 35)? (8.35) 
where 

cilog,(co¥?m) 2D (2m —1)4 


cı = 160, c2 = In(2)/(4c°), c = 103, 4 = 9/||w|L D = D(X, fod) = D(X, fw,p, 0), 
and p is given by (8.16). 


The proof can be found in [475]. 

The training sample X defines (via the algorithm) the decision region Rw,». We 
expect that new points generated according to P will lie in Rw,p. The theorem gives 
a probabilistic guarantee that new points lie in the larger region Rw,p—y- 

The parameter v can be adjusted when running the algorithm to trade off 
incorporating outliers against minimizing the “size” of Rw,». Adjusting v changes 
the value of D. Note that since D is measured with respect to p while the bound 
applies to p — 7, any point which is outside of the region to which the bound 
applies will make a contribution to D that is bounded away from 0. Therefore, 
(8.35) does not imply that asymptotically, we will always estimate the complete 
support. 

The parameter y allows us to trade off the confidence with which we wish the 
assertion of the theorem to hold against the size of the predictive region Rw,,—+: 


8.6 Discussion 


241 


we can see from (8.36) that k, and hence the rhs of (8.35), scales inversely with y. 
In fact, it scales inversely with 4; in other words, it increases with w. This justifies 
measuring the complexity of the estimated region by the size of w, and minimizing 
||w||? in order to find a region that generalizes well. In addition, the theorem 
suggests not to use the offset p returned by the algorithm, which corresponds to 
y = 0, but a smaller value p — y (with y > 0). 

In the present form, Theorem 8.6 is not a practical means to determine the 
parameters v and y explicitly. It is loose in several ways. The constant c used 
is far from its smallest possible value. Furthermore, no account is taken of the 
smoothness of the kernel. If that were done (by using refined bounds on the 
covering numbers of the induced class of functions, as in Chapter 12), then the 
first term in (8.36) would increase much more slowly when decreasing 7. The 
fact that the second term would not change indicates a different trade-off point. 
Nevertheless, the theorem provides some confidence that v and y are suitable 
parameters to adjust. 


8.6 Discussion 


Vapnik’s 
Principle 


Existence of a 
Density 


Before we move on to experiments, it is worthwhile to discuss some aspects of the 
algorithm described. As mentioned in the introduction, we could view it as being 
in line with Vapnik’s principle never to solve a problem which is more general than 
the one that we are actually interested in [561]. For instance, in situations where 
one is only interested in detecting novelty, it is not always necessary to estimate 
a full density model of the data. Indeed, density estimation is more difficult than 
what we are doing, in several respects. 

Mathematically speaking, a density only exists if the underlying probability 
measure possesses an absolutely continuous distribution function. The general 
problem of estimating the measure for a large class of sets, say the sets measurable 
in Borel’s sense, is not solvable, however (for a discussion, see [562]). Therefore 
we need to restrict ourselves to making a statement about the measure of some 
sets. Given a small class of sets, the simplest estimator which accomplishes this 
task is the empirical measure, which simply looks at how many training points 
fall into the region of interest. The present algorithm does the opposite. It starts 
with the number of training points that are supposed to fall into the region, and 
then estimates a region with the desired property. Often, there will be many such 
regions — the solution becomes unique only by applying a regularizer, which in 
the SV case enforces that the region be small in a feature space associated with the 
kernel. 

Therefore, we must keep in mind that the measure of smallness in this sense 
depends on the kernel used, in a way that is no different to any other method 
that regularizes in a feature space. A similar problem, however, already appears 
in density estimation when done in input space. Let p denote a density on X. 
If we perform a (nonlinear) coordinate transformation in the input domain X, 
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then the density values change; loosely speaking, what remains constant is p(x) dx, 
where dx is also transformed. When directly estimating the probability measure of 
regions, we are not faced with this problem, as the regions automatically change 
accordingly. 

An attractive property of the measure of smallness that the present algorithm 
uses is that it can also be placed in the context of regularization theory, leading to 
an interpretation of the solution as maximally smooth in a sense which depends 
on the specific kernel used. More specifically, if k is a Green’s function of Y*Y for 
an operator Y mapping into some dot product space (cf. Section 4.3), then the dual 
objective function that we minimize equals 


È oiak x) = [FIP (8.37) 
1] 


using f(x) = X; aik(x;, x). In addition, we show in Chapter 4 that the regularization 
operators of common kernels can be shown to correspond to derivative operators 
— therefore, minimizing the dual objective function has the effect of maximizing 
the smoothness of the function f (which is, up to a thresholding operation, the 
function we estimate). This, in turn, is related to a prior with density p(f) œ el" I? 
on the function space (cf. Chapter 16). 

Interestingly, as the minimization of the dual objective function also corre- 
sponds to a maximization of the margin in feature space, an equivalent interpre- 
tation is in terms of a prior on the distribution of the unknown other class (the 
“novel” class in a novelty detection problem) — trying to separate the data from 
the origin amounts to assuming that the novel examples lie around the origin. 

The main inspiration for the approach described stems from the earliest work of 
Vapnik and collaborators. In 1962, they proposed an algorithm for characterizing 
a set of unlabelled data points by separating it from the origin using a hyperplane 
[573, 570]. However, they quickly moved on to two-class classification problems, 
both in terms of algorithms and in terms of the theoretical development of statis- 
tical learning theory which originated in those days. 

From an algorithmic point of view, we can identify two shortcomings of the 
original approach, which may have caused research in this direction to stop for 
more than three decades. First, the original algorithm [570] was limited to linear 
decision rules in input space; second, there was no way of dealing with outliers. In 
conjunction, these restrictions are indeed severe — a generic dataset need not be 
separable from the origin by a hyperplane in input space. The two modifications 
that the single-class SVM incorporates dispose of these shortcomings. First, the 
kernel trick allows for a much larger class of functions by nonlinearly mapping 
into a high-dimensional feature space, and thereby increases the chances of a 
separation from the origin being possible. In particular, using a Gaussian kernel 
(8.5), such a separation is always possible, as shown in Section 8.5. The second 
modification directly allows for the possibility of outliers. This ‘softness’ of the 
decision rule is incorporated using the v-trick, which we have already seen in 
the classification case (Section 7.5), leading to a direct handle on the fraction of 
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outliers. 

Given v € (0,1], the resulting algorithm computes (8.6) subject to (8.7), and 
thereby constructs a region R such that for OL = {i: x; ¢ R}, we have lou <v. 
The “<” is sharp in the sense that if we multiply the solution w by (1 — e) (with 
e > 0), it becomes a “>.” The algorithm does not solve the following combinatorial 
problem, however: given v € (0, 1], compute 


ee. ol 
minimize = 


2 
weEH,OLC[m] 2 IlwIlf, 


subject to (w, ®(x;)) > 1 fori € [m] \ OL and 


(on =p. (8.38) 


Ben-David et al. [31] analyze a problem related to (8.38): they consider a sphere 
(which for some feature spaces is equivalent to a half-space, as shown in Sec- 
tion 8.3), fix its radius, and attempt to find its center such that it encloses as many 
points as possible. They prove that it is already NP hard to approximate the maxi- 
mal number to within a factor smaller than 3/418. 

We conclude this section by mentioning another kernel-based algorithm that 
has recently been proposed for the use on unlabelled data [541]. This algorithm 
applies to vector quantization, a standard process which finds a codebook such 
that the training set can be approximated by elements of the codebook with small 
error. Vector quantization is briefly described in Example 17.2 below; for further 
detail, see [195]. 

Given some metric d, the kernel-based approach of [541] uses a kernel that 
indicates whether two points lie within a distance R > 0 of each other, 


k(x, x’) = Li xexx X:d(x, xN) <R}" (8.39) 


Let ®,„ be the empirical kernel map (2.56) with respect to the training set. The main 
idea is that if we can find a vector œ € R” such that 


a'®,,(x;) > 0 (8.40) 


holds true for all i =1,...,m, then each point x; lies within a distance R of some 
point x; which has a positive weight w; > 0. To see this, note that otherwise all 
nonzero components of œ would get multiplied by components of ®,, which are 
0, and the dot product in (8.40) would equal 0. 

To perform vector quantization, we can thus use optimization techniques, which 
produce a vector @ that satisfies (8.40) while being sparse. As in Section 7.7, this 
can be done using linear programming techniques. Once optimization is complete, 
the nonzero entries of œ indicate the codebook vectors. 


We apply the method to artificial and real-world data. Figure 8.6 shows a compar- 
ison with a Parzen windows estimator on a 2-D problem, along with a family of 
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Figure 8.6 A single-class SVM applied to a toy problem; c = 0.5, domain: [—1,1]*, for 
various settings of the offset p. As discussed in Section 8.3, v = 1 yields a Parzen windows 
expansion. However, to get a Parzen windows estimator of the distribution’s support, we 
must not use the offset returned by the algorithm (which would allow all points to lie 
outside the estimated region). Therefore, in this experiment, we adjusted the offset such 
that a fraction v' = 0.1 of patterns lie outside. From left to right, we show the results 
for v € {0.1,0.2,0.4, 1}. The rightmost picture corresponds to the Parzen estimator which 
utilizes all kernels; the other estimators use roughly a fraction v of kernels. Note that as a 
result of the averaging over all kernels, the Parzen windows estimate does not model the 
shape of the distribution very well for the chosen parameters. 


estimators which lie “in between” the present one and the Parzen one. 

Figure 8.7 shows a plot of the outputs (w, ®(x)) on training and test sets of the 
USPS database of handwritten digits (Section A.1). We used a Gaussian kernel 
(8.5), which has the advantage that the data are always separable from the origin 
in feature space (Section 2.3). For the kernel parameter c, we used 0.5 - 256. This 
value was chosen a priori, and is a common value for SVM classifiers on that data 
set, cf. Chapter 7.4 The algorithm was given only the training instances of digit 0. 
Testing was done on both digit 0 and on all other digits. We present results for two 
values of y, one large, one small; for values in between, the results are qualitatively 
similar. In the first experiment, we used v = 50%, thus aiming for a description of 
“0-ness” which only captures half of all zeros in the training set. As shown in figure 
8.7, this leads to zero false positives (i.e., even though the learning machine has not 
seen any non-0s during training, it correctly identifies all non-Os as such), while 
still recognizing 44% of the digits 0 in the test set. Higher recognition rates can be 
achieved using smaller values of v. For v = 5%, we get 91% correct recognition of 
digits 0 in the test set, with a fairly moderate false positive rate of 7%. 

Although leading to encouraging results, this experiment does not really ad- 
dress the actual task the algorithm was designed for. Therefore, we next focus on 
a problem of novelty detection. Again, we utilized the USPS set; this time, however, 
we trained the algorithm on the test set and used it to identify outliers — it is well 
known that the USPS test set (Figure 8.8) contains a number of patterns which 


4. In [236], the following procedure is used to determine a value of c. For small c, all training 
points become SVs — the algorithm just memorizes the data, and will not generalize well. 
As c increases, the number of SVs drops. As a simple heuristic, we can thus start with a 
small value of c and increase it until the number of SVs does not decrease any further. 
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Figure 8.7 Experiments using the USPS 
OCR dataset. Recognizer for digit 0; out- 
put histogram for the exemplars of 0 in the 
training /test set, and on test exemplars of 
other digits. The x-axis gives the output val- 
ues; in other words, the argument of the 
sgn function in (8.12). For v = 50% (top), 
we get 50% SVs and 49% outliers (consis- 
tent with Proposition 8.3), 44% true positive 
test examples, and zero false positives from 
the “other” class. For v = 5% (bottom), we 
get 6% and 4% for SVs and outliers, respec- 
tively. In that case, the true positive rate is 
improved to 91%, while the false positive 
rate increases to 7%. The offset p is marked 
in the graphs. 

Note, finally, that the plots show a Parzen 
windows density estimate of the output his- 
tograms. In reality, many examples sit ex- 
actly at the offset value (the non-bound 
SVs). Since this peak is smoothed out by 
the estimator, the fractions of outliers in the 
ssa training set appear slightly larger than they 
output should be. 


6972313487653 
e735 1060827 


Figure 8.8 20 examples randomly drawn from the USPS test set, with class labels. 


are hard or impossible to classify, due to segmentation errors or mislabelling. In 
this experiment, we augmented the input patterns by ten extra dimensions corre- 
sponding to the class labels of the digits. The rationale for this is that if we dis- 
regard the labels, there would be no hope of identifying mislabelled patterns as 
outliers. With the labels, the algorithm has the chance to identify both unusual 
patterns and usual patterns with unusual labels. Figure 8.9 shows the 20 worst 
outliers for the USPS test set, respectively. Note that the algorithm indeed extracts 
patterns which are very hard to assign to their respective classes. In the exper- 
iment, we used the same kernel width as above, and a v value of 5%. The latter 
was chosen to roughly reflect the expectation as to how many “bad” patterns there 
are in the test set: most good learning algorithms achieve error rates of 3 - 5% on 
the USPS benchmark (for a list of results, cf. Table 7.4). 
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Table 8.1 Experimental results for various values of the outlier control constant v, USPS 
test set, size m = 2007. Note that v bounds the fractions of outliers and Support Vectors 
from above and below, respectively (cf. Proposition 8.3). As we are not in the asymptotic 
regime, there is some slack in the bounds; nevertheless, v can be used to control these 
fractions. Note, moreover, that training times (CPU time in seconds on a Pentium II running 
at 450 MHz) increase as v approaches 1. This is related to the fact that almost all Lagrange 
multipliers are at the upper bound in that case (cf. Section 8.4). The system used in the 
outlier detection experiments is shown in boldface. 
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Figure 8.9 Outliers identified by the proposed algorithm, ranked by the negative output of 
the SVM (the argument of (8.12)). The outputs (for convenience in units of 107°) are written 
underneath each image in italics, the (alleged) class labels are given in bold face. Note that 
most of the examples are “difficult” in that they are either atypical or mislabelled. 


In the last experiment, we tested the runtime scaling behavior of the SMO solver 
used for training (Figure 8.10). Performance was found to depend on v. For the 
small values of v which are typically used in outlier detection tasks, the algorithm 
scales very well to larger data sets, with a dependency of training times on the 
sample size which is at most quadratic. 

In addition to the experiments reported above, the present algorithm has since 
been applied in several other domains, such as the modelling of parameter regimes 
for the control of walking robots [528], condition monitoring of jet engines [236], 
and hierarchical clustering problems [35]. 
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Figure 8.10 Training times t (in seconds) vs. data set sizes m (both axes depict logs at base 
2; CPU time in seconds on a Pentium II running at 450 MHz, training on subsets of the 
USPS test set); c = 0.5 - 256. As in Table 8.1, it can be seen that larger values of v generally 
lead to longer training times (note that the plots use different y-axis ranges). Times also 
differ in their scaling with the sample size, however. The exponents can be directly read 
from the slope of the graphs, as they are plotted in log scale with equal axis spacing: for 
small values of v (< 5%), the training times are approximately linear in the training set 
size (left plot). The scaling gets worse as v increases. For large values of y, training times 
are roughly proportional to the sample size raised to the power of 2.5 (right plot). The 
results should be taken only as an indication of what is going on: they were obtained 
using fairly small training sets; the largest being 2007, the size of the USPS test set. As a 
consequence, they are fairly noisy, and strictly speaking, they refer only to the examined 
regime. Encouragingly, the scaling is better than the cubic scaling that we would expect 
when solving the optimization problem using all patterns at once, cf. Section 8.4. Moreover, 
for small values of v, which are typically used in outlier detection (in Figure 8.9, we used 
v = 5%), the algorithm is particularly efficient. 


8.8 Summary 


In the present chapter, we described SV algorithms that can be applied to unla- 
belled data. Statistically speaking, these are a solution to the problem of multi- 
dimensional quantile estimation. Practically speaking, they provide a “descrip- 
tion” of the dataset that can thought of as a “single-class” classifier, which can tell 
whether a new point is likely to have been generated by the same regularity as the 
training data. This makes them applicable to problems such as novelty detection. 
We began with a discussion of the quantile estimation problem, which led us 
to a quantile estimation algorithm, which is very similar to the SV classification 
algorithm in that it uses the same type of large margin regularizer. To deal with 
outliers in the data, we made use of the v-trick introduced in the previous chapter. 
We described an SMO-style optimization problem to compute the unique solution 
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of the algorithm, and we provided a theoretical analysis of a number of aspects, 
such as robustness with respect to outliers, and generalization to unseen data. 

Finally, we described outlier detection experiments using real-world data. It is 
worthwhile to note at this point that single-class approaches have abundant prac- 
tical applications. Suggestions include information retrieval, problems in medical 
diagnosis [534], marketing [33], condition monitoring of machines [138], estimat- 
ing manufacturing yields [531], econometrics and generalized nonlinear principal 
curves [550, 308], regression and spectral analysis [417], tests for multimodality 
and clustering [416] and others [374]. Many of these papers, in particular those 
with a theoretical slant [417, 114, e.g.], do not go all the way in devising practical 
algorithms that work on high-dimensional real-world-problems, a notable excep- 
tion being the work of Tarassenko et al. [534]. The single-class SVM described in 
this chapter constitutes a practical algorithm with well-behaved computational 
complexity (convex quadratic programming) for these problems. 


8.9 Problems 


8.1 (Uniqueness of the Supporting Hyperplane ee) Using Lemma 6.24 (cf. Fig- 
ure 6.7), prove that if the convex hull of the training data does not contain the origin, 
then there exists a unique hyperplane with the properties that (1) it separates all data from 
the origin, and (2) its distance to the origin is maximal among all such hyperplanes. (cf. 
Proposition 8.1). 

Hint: argue that you can limit yourself to finding the maximum margin of separation 
over all weight vectors of length ||w|| < 1; that is, over a convex set. 


8.2 (Soft Ball Algorithm [535] ee) Extend the hard ball algorithm described in Sec- 
tion 5.6 to the soft margin case; in other words, derive (8.18) and (8.19) from (8.17). 


8.3 (Hard Margin Limit for Positive Offsets e) Show that if we require p > 0 in the 
primal problem, we end up with the constraint X; a; > 1 instead of X; aj = 1 (see (8.15)). 
Consider the hard margin limit v — 0 and argue that in this case, the problem can become 
infeasible, and the multipliers a; can diverge during the optimization. Give a geometric 
interpretation of why this happens for p > 0, but not if p is free. 


8.4 (Positivity of p ee) Prove that the solution of (8.6) satisfies p > 0. 


8.5 (Hard Margin Identities ee) Consider the hard margin optimization problem, con- 
sisting of minimizing ||w||? subject to (w,x;) > p for i € [m] (here, p > 0 is a constant). 
Prove that the following identities hold true for the solution œ: 

p 2 

z 7 IWI = 2W(a) (8.41) 
Here, W(œ) denotes the value of the dual objective function, and d is the distance of the 
hyperplane to the origin. Hint: use the KKT conditions and the primal constraints. 
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8.6 (v-Property for Single-Class SVMs [475] ee) Prove Proposition 8.3, using the 
techniques from the proof of Proposition 7.5. 


8.7 (Graphical Proof of v-Property e) Give a graphical proof of the first two statements 
of Proposition 8.3 along the lines of Figure 9.5. 


8.8 (v-Property for the Soft Ball Algorithm ee) Prove Proposition 8.3 for the soft ball 
algorithm (8.18), (8.19). 


8.9 (Multi-class SV Classification ee) Implement the single-class algorithm and use it 
to solve a multi-class classification problem by training a single-class recognizer on each 
class. Discuss how to best compare the outputs of the individual recognizers in order to get 
to a multi-class decision. 


8.10 (Negative Data [476] ee) Derive a variant of the single-class algorithm that can 
handle negative data by reflecting the negative points at the origin in feature space, and 
then solving the usual one-class problem. Show that this leads to the following problem: 


an ooh 
minimize 5 » aia jyiy jk(xi, Xj), (8.42) 
subject to 0 < aj < 2; (8.43) 
vm 
Yael (8.44) 


Argue, moreover, that the decision function takes the form 


f(x) = sgn (5 aiyik(xi, x) — ) i (8.45) 

and that p can be computed by exploiting that for any x; such that a; € (0,1/(vm)), 

p= (w, (x;)) = > apy jk(x;, X). (8.46) 
j 


Show that the algorithm (8.13) is a special case of the above one. Discuss the connection to 
the SVC algorithm (Chapter 7), in particular with regard to how the two algorithms deal 
with unbalanced data sets. 


8.11 (Separation from General Points [476] ee) Derive a variant of the single-class 
algorithm that, rather than separating the points from the origin in feature space, separates 
them from the mean of some other set of points P(z1),..., (z+). Argue that this lets you 
take into account the unknown “other” class in a “weak” sense. Hint: modify the first 
constraint in (8.7) to 


(w.o = I > en) >p- $i, (8.47) 
n=1 
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and the decision function to 


t 
f(x) = sgn (Cx D(x) — k ` oen) — p) (8.48) 
n=1 


Prove that the dual problem takes the following form: 


| 
minimize 5 » aiaj (k(x; x) +49 — qj- qi), (8.49) 
subject to 0 < aj < = (8.50) 
vm 
Y= 1; (8.51) 


where q = EDn K(Zn,Zp) and qj = $n K(x}, Zn). 
Discuss the special case where you try to separate a data set from its own mean and 
argue that this provides a method for computing one-sided quantiles with large margin. 


8.12 (Cross-Validation e) Discuss the question of how to validate whether a single- 
class SVM generalizes well. What are the main differences with the problem of pattern 
recognition? 


8.13 (Leave-One-Out Bound ee) Using Theorem 12.9, prove that the generalization 
error of single-class SVMs trained on samples of size m — 1 is bounded by the number 
of SVs divided by m. 

Note that this bound makes a statement about the case where y = 0 (cf. Theorem 8.6). 
Argue that this can cause the bound to be loose. Compare the case of pattern recognition, 
and argue that the usual leave-one-out bound is also loose there, since it makes a statement 
about the probability of a test point being misclassified or lying inside the margin. 


8.14 (Kernel-Dependent Generalization Error Bounds cco) Modify Theorem 8.6 to 
take into account properties of the kernel along the lines of the entropy number methods 
described in Chapter 12. 


8.15 (Model Selection 000) Try to come up with principled model selection methods for 
single-class SVMs. How would you recommend to choose v and the kernel parameter? 
How would you choose y (Theorem 8.6)? 


Overview 


Prerequisites 


Regression Estimation 


In this chapter, we explain the ideas underlying Support Vector (SV) machines for 
function estimation. We start by giving a brief summary of the motivations and 
formulations of an SV approach for regression estimation (Section 9.1), followed 
by a derivation of the associated dual programming problems (Section 9.2). After 
some illustrative examples, we cover extensions to linear programming settings 
and a y-variant that utilizes a more convenient parametrization. In Section 9.6, we 
discuss some applications, followed by a summary (Section 9.7) and a collection 
of problems for the reader. 

Although it is not strictly indispensable, we recommend that the reader first 
study the basics of the SVM classification algorithm, at least at the level of detail 
given in Chapter 1. The derivation of the dual (Section 9.2) is self-contained, but 
would benefit from some background in optimization (Chapter 6), especially in 
the case of the more advanced formulations given in Section 9.2.2. If desired, these 
can actually be skipped at first reading. Section 9.3 describes a modification of 
the standard SV regression algorithm, along with some considerations on issues 
such as robustness. The latter can be best understood within the context given in 
Section 3.4. Finally, Section 9.4 deals with linear programming regularizers, which 
were discussed in detail in Section 4.9.2. 


9.1 Linear Regression with Insensitive Loss Function 


€-Insensitive 
Loss 


SVMs were first developed for pattern recognition. As described in Chapter 7, they 
represent the decision boundary in terms of a typically small subset of all train- 
ing examples — the Support Vectors. When the SV algorithm was generalized to 
the case of regression estimation (that is, to the estimation of real-valued functions, 
rather than just {+1}-valued ones, as in the case in pattern recognition), it was cru- 
cial to find a way of retaining this feature. In order for the sparseness property to 
carry over to the case of SV Regression, Vapnik devised the so-called ¢-insensitive 
loss function (Figure 1.8) [561], 


ly ~ fle = max{0, ly 7 f(x)| + E}, (9.1) 
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which does not penalize errors below some e > 0, chosen a priori.! The rationale 
behind this choice is the following. In pattern recognition, when measuring the 
loss incurred for a particular pattern, there is a large area where we accrue zero 
loss: whenever a pattern is on the correct side of the decision surface, and does 
not touch the margin, it does not contribute any loss to the objective function 
(7.35). Correspondingly, it does not carry any information about the position of 
the decision surface — after all, the latter is computed by minimizing that very 
objective function. This is the underlying reason why the pattern does not appear 
in the SV expansion of the solution. A loss function for regression estimation must 
also have an insensitive zone; hence we use the ¢€-insensitive loss. 

The regression algorithm is then developed in close analogy to the case of 
pattern recognition. Again, we estimate linear functions, use a ||w||* regularizer, 
and rewrite everything in terms of dot products to generalize to the nonlinear case. 
The basic SV regression algorithm, which we will henceforth call e-SVR, seeks to 


estimate linear functions, 


f(x) = (w,x) +b, where w,x € H,b € R, (9.2) 
based on independent and identically distributed (iid) data, 
(x1; Y1); -<3 Xm Ym) E H x R. (9.3) 


Here, H is a dot product space in which the (mapped) input patterns live (i.e., the 
feature space induced by a kernel). The goal of the learning process is to find a 


1. The insensitive zone is sometimes referred to as the £-tube. Actually, this term is lightly 
misleading, as in multi-dimensional problems, the insensitive zone has the shape of a slab 
rather than a tube; in other words, the region between two parallel hyperplanes, differing 
in their y offset. 

2. Strictly speaking, these should be called affine functions. We will not indulge in these fine 
distinctions. The crucial bit is that the part to which we apply the kernel trick is linear. 
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Regularized Risk 
Functional 


Flatness 
vs. Margin 


function f with a small risk (or test error) (cf. Chapter 3), 


RIfI= f e(F,x,y)4P%y), (0.4) 


where P is the probability measure which is assumed to be responsible for the 
generation of the observations (9.3), and c is a loss function, such as c(f,x, y) = 
(f(x) — y)*, or one of many other possible choices (Chapter 3). The particular 
loss function for which we would like to minimize (9.4) depends on the specific 
regression estimation problem at hand. Note that this does not necessarily have to 
coincide with the loss function used in our learning algorithm. First, there might 
be additional constraints that we would like our regression estimation to satisfy, 
for instance that it have a sparse representation in terms of the training data — 
in the SV case, this is achieved through the insensitive zone in (9.1). Second, we 
cannot minimize (9.4) directly in any case, since we do not know P. Instead, we 
are given the sample (9.3), and we try to obtain a small risk by minimizing the 
regularized risk functional, 


1 E 
zllwll? +C- empl f]. (9.5) 
Here, 
1 m 
Remplfl = — X ly: — fle (9.6) 
i=1 


measures the é-insensitive training error, and C is a constant determining the 
trade-off with the complexity penalizer ||w||?. In short, minimizing (9.5) captures 
the main insight of statistical learning theory, stating that in order to obtain a small 
risk, we need to control both training error and model complexity, by explaining 
the data with a simple model (Chapter 5). 

A small ||w||? corresponds to a linear function (9.2) that is flat — in feature 
space. Note that in the case of pattern recognition, we use the same regularizer 
(cf. (7.35)); however, it corresponded to a large margin in this case. How does this 
difference arise? A related question, it turns out, is why SV regression requires an 
extra parameter £, while SV pattern recognition does not. 

Let us try to understand how these seemingly different problems are actually 
identical. 


Definition 9.1 (¢-margin) Let (E, ||.||£), (G, ||-||c) be normed spaces, and X C E. We 
define the ¢-margin of a function f : X —> Gas 


me(f) := inf{||x — x'lle | x, x! € X, |]f(x) — fOe 2 2e}. (9.7) 


Let us look at a 1-D toy problem (Figure 9.1). In pattern recognition, we are looking 
for a function which exceeds some constant £ (using the canonical hyperplanes 
of Definition 7.1, this constant is £ = 1) on the positive patterns, and which is 
smaller than —e on the negative patterns. The points where the function takes 
the values +e define the e-margin in the space of the patterns (the x-axis in 
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a . Figure 9.1 1D toy problem: 
a f _ separate ‘x’ from ‘o’. The 
ey _ i ic SV classification algorithm 
A D constructs a linear function 
x Ny wa o 5 = f(x) = (w,x) +b satisfying 
B DON, the canonicality condition 
min{lflle € X} = 1 (or 
Se z equivalently, € = 1). To maxi- 
mf) ~ 5 mize the margin m,(f), we have 

to minimize |w]. 


the plot). Therefore, the flatter the function f, the larger the classification margin. 
This illustrates why both SV regression and SV pattern recognition use the same 
regularizer ||w||?, albeit with different effects. For more detail on these issues and 
on the é-margin, cf. Section 9.8; cf. also [561, 418]. 

The minimization of (9.5) is equivalent to the following constrained optimiza- 
tion problem: 


; 1 1 m 

R TO (*) 2 * 
minimize 7T(w, = -|w +C- — i+ &), 9.8 
minimize rwg) = sw? +C- LG +6) (9.8) 
subject to ((w,x;) + b) — y; < € + £i, (9.9) 
Yi— ((w, X;) + b) < E+ & (9.10) 
E So (9.11) 
Here and below, it is understood that i = 1, . . .,m, and that bold face Greek letters 


denote m-dimensional vectors of the corresponding variables; “ is a shorthand 
implying both the variables with and without asterisks. 


9.2 Dual Problems 


9.2.1 ¢-Insensitive Loss 


The key idea is to construct a Lagrangian from the objective function and the 
corresponding constraints, by introducing a dual set of variables. It can be shown 
that this function has a saddle point with respect to the primal and dual variables 
at the solution; for details see Chapter 6. We define a Lagrangian, 


1 E m ; m eae 
Le zllwl? + A a a aa (9.12) 
i=1 i=1 
— J aile + & + yi — (w,x;) — b) 
i=1 
— Vaile + & — yi + (w,xi) +b), 
i=1 
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where the dual variables (or Lagrange multipliers) in (9.12) have to satisfy positiv- 
ity constraints, 


a, >0. (9.13) 


It follows from the saddle point condition (Chapter 6) that the partial derivatives 
of L with respect to the primal variables (w, b, &;, €*) have to vanish for optimality; 


ðL= Yaj-at) =0, (9.14) 
wL =w — Z; (aj — a;)x;= 0, (9.15) 
ð= EaP- =0. (9.16) 


Substituting (9.14), (9.15), and (9.16) into (9.12) yields the dual optimization prob- 
lem, 
-4 ¥ (af — a(o} — 04) (x:,x;) 


maximize hjal 


m —e $ (af +a) + ¥ yila} — ai), (9.17) 
i=1 i=1 
subject to (a; — až) = 0 and aj, as € [0, C/m]. 
=i 


In deriving (9.17), we eliminate the dual variables 7;, ņnž through condition (9.16). 
Eq. (9.15) can be rewritten as 
m m 
w = D(a; — ai)xi, thus f(x) = Ñ (a7 — ai) (xi,x) +b. (9.18) 
i=1 i=1 
This is the familiar SV expansion, stating that w can be completely described as a 
linear combination of a subset of the training patterns x;. 

Note that just as in the pattern recognition case, the complete algorithm can be 
described in terms of dot products between the data. Even when evaluating f(x), 
we need not compute w explicitly. This will allow the formulation of a nonlinear 
extension using kernels. 

So far we have neglected the issue of computing b. The latter can be done by 
exploiting the Karush-Kuhn-Tucker (KKT) conditions (Chapter 6). These state that 
at the point of the solution, the product between dual variables and constraints 
has to vanish; 


i i— yi xi) +b) = 0, 
oile + & yi + (w,xi) +b) 0 (9.19) 
axle + & +yi— (w,xi)—b) = 0, 
and 
C aje = 
Cr Qi)&i 0, (9.20) 
mT = 0. 


This allows us to draw several useful conclusions. 


a First, only examples (x;, y;) with corresponding al = C/m can lie outside the 
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e-insensitive tube (i.e., ¢® > 0) around f. 


= Second, we have aja; = 0. In other words, there can never be a set of dual 
variables a;, a; which are both simultaneously nonzero (cf. Problem 9.1). 


= Third, for ay? € (0,C/m) we have ew) = 0, and furthermore the second factor in 
(9.19) must vanish. Hence b can be computed as follows: 


b = yi—(w,xi)—e fora; €(0,C/m), 


; (9.21) 
b = yi—(w,xi)+e for až €(0,C/m). 


Theoretically, it suffices to use any Lagrange multiplier in (0, C/m), If given the 
choice between several such multipliers (usually there are many multipliers which 
are not ‘at bound,’ meaning that they do not equal 0 or C /m), it is safest to use one 
that is not too close to 0 or C/m. 

Another way of computing b will be discussed in the context of interior point opti- 
mization (cf. Chapter 10). There, b turns out to be a by-product of the optimization 
process. See also [291] for further methods to compute the constant offset. 


a A final note must be made regarding the sparsity of the SV expansion. From (9.19) 
it follows that the Lagrange multipliers may be nonzero only for |f(x;) — y;| > €; 
in other words, for all examples inside the £-tube (the shaded region in Figure 1.8) 
the a;, až vanish. This is because when |f(x;) — y;| < £ the second factor in (9.19) 
is nonzero, hence aj,a; must be zero for the KKT conditions to be satisfied. 
Therefore we have a sparse expansion of w in terms of x; (we do not need all x; 
to describe w). The examples that come with nonvanishing coefficients are called 
Support Vectors. It is geometrically plausible that the points inside the tube do not 
contribute to the solution: we could remove any one of them, and still obtain the 
same solution, therefore they cannot carry any information about it. 


9.2.2 More General Loss Functions 


We will now consider loss functions c(x, y, f(x)) which for fixed x and y are convex 
in f(x). This requirement is chosen as we want to ensure the existence and unique- 
ness (for strict convexity) of a minimum of optimization problems (Chapter 6). 
Further detail on loss functions can be found in Chapter 3; for now, we will focus 
on how, given a loss function, the optimization problems are derived. 

For the sake of simplicity, we will additionally assume c to be symmetric, to have 
(at most) two (for symmetry) discontinuities at te,e > 0 in the first derivative, 
and to be zero in the interval [—<,¢]. All loss functions from table 3.1 belong to 
this class. Hence c will take on the form 


c(x, y, f(x) = (ly — fle). (9.22) 


Note the similarity to Vapnik’s ¢-insensitive loss. It is rather straightforward to 
extend this special choice to more general convex loss functions: for nonzero loss 
functions in the interval [—e,¢], we use an additional pair of slack variables. 
Furthermore we might choose different loss functions ĉ;, ¢7 and different values 
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of ¢;, €* for each example. At the expense of additional Lagrange multipliers in the 
dual formulation, additional discontinuities can also be dealt with. In a manner 
analogous to (9.8), we arrive at a convex minimization problem [512] (note that for 
ease of notation, we have absorbed the sample size in C; cf. (9.8)): 


m 
minimize 5 ||w||* +C E (Eé) +E), 
wEH,E ER”, bER i=1 


(w,xi)+b—-—y; < €+6&, (9.23) 
subject to yi—(w,x))-b < e+, 


Again, by standard Lagrange multiplier techniques, using exactly the same rea- 
soning as in the | - |e case, we can compute the dual optimization problem. In some 
places, we will omit the indices ; and * to avoid tedious notation. This yields 


m 
—$ „2 (a = ai)(a; = aj) (xi, xj) 
1,J= 


maximize + 5 (yila} — ai) — ela¥ž + a;)) 
a ER" i=1 

+C Ë (TE) HTE), 

where i = ~ oe ae (9.24) 
T(g) := E) — €0,€(€), 
x (ai = až) = 0, 
i=1 

subject to a = Cogs), 
a, E > 0. 


Let us consider the examples of Table 3.1 as special cases. We will explicitly show 
for two examples how (9.24) can be further simplified to reduce it to a form that is 
practically useful. In the ¢-insensitive case, where ¢(£) = |€|, we get 


T(é)=€-€-1=0. (9.25) 
We can further conclude from 0¢¢(€) = 1 that 
€ =inf{€|C > a} =O and a€ [0,C]. (9.26) 


In the case of piecewise polynomial loss, we have to distinguish two different 
cases: € <a and € > oa. In the first case we get 


1 1 
T(E) = por~! gl = =| 


p— P1 pep 27 
T$ r” ae (9.27) 


and £ = inf{é|Co!PéP-! > a} = oC ar; thus 


T(= Pac Pam. (9.28) 
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Table 9.1 Terms of the convex optimization problem depending on the choice of the loss 
function. 


ea rey ë 
sinsenstive [eFO[ ae o i 
Ttaplacion eco faena [0 


Huber’s 
=0 0,C —10C71a2 


Peewee | 220) | wei | estar 
polynomial P 


In the second case (£ > a) we have 


—1 =l 
T(@)=¢-0£ —-¢=-oF —, (9.29) 
P P 
and € = inf{é|C > a} = g; hence a € [0, C]. These two cases can be combined to 
yield 


a € [0,C] and T(a) = Poot Par, (9.30) 


Table 9.1 contains a summary of the various conditions on a, and formulas for 
T(a) for different loss functions.’ Note that the maximum slope of ĉ determines the 
region of feasibility of a, meaning that s := sup,cp+O¢C() < 00 leads to compact 
intervals [0, Cs] for a. This means that the influence of a single pattern is bounded, 
leading to robust estimators (cf. Chapter 3 and Proposition 9.4 below). We also 
observe experimentally that the performance of an SVM depends on the loss 
function used [376, 515, 95]. 

A cautionary remark is necessary regarding the use of loss functions other than 
the c-insensitive loss. Unless € Æ 0, we lose the advantage of a sparse decompo- 
sition. This may be acceptable in the case of few data, but will render the pre- 
diction step rather slow otherwise. Hence we have to trade off a potential loss in 
prediction accuracy with faster predictions. Note, however, that this issue could 
be addressed using reduced set algorithms like those described in Chapter 18, 
or sparse decomposition techniques [513]. In a Bayesian setting, Tipping [539] re- 
cently showed how the squared loss function can be used without sacrificing spar- 
sity, cf. Section 16.6. 


3. The table displays CT(qa) instead of T(a), since the former can be plugged directly into 
the corresponding optimization equations. 
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Figure 9.2 Architecture of a regression machine constructed using the SV algorithm. In 
typical regression applications, the inputs would not be visual patterns. Nevertheless, in 
this example, the inputs are depicted as handwritten digits. 


9.2.3 The Bigger Picture 


Let us briefly review the basic properties of the SV algorithm for regression, as 
described so far. Figure 9.2 contains a graphical overview of the different steps in 
the regression stage. The input pattern (for which a prediction is to be made) is 
mapped into feature space by a map ®. Then dot products are computed with the 
images of the training patterns under the map ®. This corresponds to evaluating 
kernel functions k(x;, x). Finally, the dot products are added up using the weights 
vi = a; — aj. This, plus the constant term b, yields the final prediction output. The 
process described here is very similar to regression in a neural network, with the 
difference that in the SV case, the weights in the input layer are a subset of the 
training patterns. 

The toy example in Figure 9.3 demonstrates how the SV algorithm chooses 
the flattest function among those approximating the original data with a given 
precision. Although requiring flatness only in feature space, we observe that the 
functions are also smooth in input space. This is due to the fact that kernels can be 
associated with smoothness properties via regularization operators, as explained 
in more detail in Chapter 4. 

Finally, Figure 9.4 shows the relation between approximation quality and spar- 
sity of representation in the SV case. The lower the precision required for approx- 
imating the original data, the fewer SVs are needed to encode this data. The non- 
SVs are redundant — even without these patterns in the training set, the SVM 
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Figure 9.3 From top to bottom: approximation of the function sincx with precisions 
£ = 0.1,0.2, and 0.5. The solid top and dashed bottom lines indicate the size of the ¢-tube, 
here drawn around the target function sinc x. The dotted line between them is the regression 
function. 


ip 


> 


t 
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Figure 9.4 Left to right: regression (solid line), data points (small dots) and SVs (big dots) 
for an approximation of sinc x (dotted line) with £ = 0.1,0.2, and 0.5. Note the decrease in 
the number of SVs. 


would have constructed exactly the same function f. We might be tempted to use 
this property as an efficient means of data compression, namely by storing only 
the support patterns, from which the estimate can be reconstructed completely. 
Unfortunately, this approach turns out not to work well in the case of noisy high- 
dimensional data, since for moderate approximation quality, the number of SVs 
can be rather high [572]. 


9.3 v-SV Regression 


The parameter £ of the ¢-insensitive loss is useful if the desired accuracy of the 
approximation can be specified beforehand. In some cases, however, we just want 
the estimate to be as accurate as possible, without having to commit ourselves to 
a specific level of accuracy a priori. We now describe a modification of the e-SVR 
algorithm, called v-SVR, which automatically computes € [481]. 

To estimate functions (9.2) from empirical data (9.3) we proceed as follows. At 
each point x;, we allow an error £. Everything above £ is captured in slack variables 
é, which are penalized in the objective function via a regularization constant C, 


chosen a priori. The size of £ is traded off against model complexity and slack 
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variables via a constant v > 0: 
; 1 ae ; 
minimize 7(w,€,c) = =||w|? +C. (ve t+ — N(E+ &) ], (9.31) 
we I,é ER” ,e,bER 2 m i=l 
subject to ((w,x;) +b) — yi < e+ &, (9.32) 
yi — ((w, x;) +b) <e+éŤ, (9.33) 
Bo >0, e>0. (9.34) 


For the constraints, we introduce multipliers a®,n®, 8 > 0, and obtain the La- 
grangian, 


L(w,b, a, 8,€,e,n™) = (9.35) 
1 C m ; m as 
slwl? + Cve + = Eé +E) — Be — Dini + ni) 

i=1 i=1 


m m 
-$ alé + yi — (w, xi) —b +e) — Da (GF + (w,xi) +b- yi +e). 

i=1 i= 
To minimize (9.31), we have to find the saddle point of L, meaning that we min- 
imize over the primal variables w, €, b, Ee and maximize over the dual variables 
al”, an”. Setting the derivatives with respect to the primal variables equal to 
zero yields the four equations 


w= J (af — ai)xi, (9.36) 
C-v—}(ai+a})—-ß=0, (9.37) 
$ (ai— af) =0, (9.38) 

i=l 
EaP -nP =o. (9.39) 


As in Section 9.2, the a) are nonzero in the SV expansion (9.36) only when a 
constraint (9.32) or (9.33) is precisely met. 

Substituting the above four conditions into L leads to the dual optimization 
problem (sometimes called the Wolfe dual). We will state it in the kernelized form: 
as usual, we substitute a kernel k for the dot product, corresponding to a dot 
product in some feature space related to input space via a nonlinear map ®, 


k(x, x’) = (O(a), (x) = (x, x’). (9.40) 
Rewriting the constraints, and noting that 2, nf > 0 do not appear in the dual, 
we arrive at the v-SVR Optimization Problem: for v > 0,C > 0, 


m m 


. . x x 1 * æ 
maximize W(a®) = $ (af — ayi- = Y (a — ai)(a} — ajkla, x;), (9.41) 
aC ER" i=l 2 mA 


subject to X (a; — aj) = 0, (9.42) 
i=l 
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a e [0, £], (9.43) 


Sai taj) <C-v. (9.44) 
i=1 
The regression estimate then takes the form (cf. (9.2), (9.36), (9.40)) 


flo) = Slot -adka +, (9.45) 
i=1 


where b (and £) can be computed by taking into account that (9.32) and (9.33) 
become equalities with Em = 0 for points with 0 < a® < C/m, due to the KKT 
conditions. Here, substitution of Yj(a; — aj)k(x;, x) for (w,x) is understood, cf. 
(9.36), (9.40). Geometrically, this amounts to saying that we can compute the 
thickness and vertical position of the tube by considering some points that sit 
exactly on the edge of the tube.4 

We now show that v has an interpretation similar to the case of v-SV pattern 
recognition (Section 7.5). This is not completely obvious: recall that in the case of 
pattern recognition, we introduced v to replace C. In regression, on the other hand, 
we introduced it to replace £. 

Before we give the result, the following observation concerning ¢ is helpful. If 
v > 1, then necessarily £ = 0, since it does not pay to increase £. This can be seen 
either from (9.31) — the slacks are “cheaper” — or by noting that for v > 1, (9.43) 
implies (9.44), since aja; = 0 for alli (9.58). Therefore, (9.44) is redundant, and all 
values v > 1 are actually equivalent. Hence, we restrict ourselves to 0 < v < 1. 

If v < 1, we mostly find £ > 0. It is still possible that £ = 0, for instance if the 
data are noise-free and can be perfectly interpolated with a low capacity model. 
The case £ = 0 is not what we are interested in: it corresponds to plain Lj-loss 
regression. 

Below, we will use the term errors to refer to training points lying outside the 
tube, and the term fraction of errors/SVs to denote the relative numbers of er- 
rors/SVs; that is, these respective quantities are divided by m. In this proposi- 
tion, we define the modulus of absolute continuity of a function f as the function 
e(ô) := sup >; |f(bi) — f (ai)|, where the supremum is taken over all disjoint inter- 
vals (a;,b;) with a; < b; satisfying );(b; — a;) < 6. Loosely speaking, the condition 
on the conditional density of y given x asks that it be absolutely continuous ‘on 
average.” 


Proposition 9.2 Suppose v-SVR is applied to some data set and the resulting e is 
nonzero. The following statements hold: 


4. Should it occur, for instance due to numerical problems, that it is impossible to find two 
non-bound SVs at the two edges of the tube, then we can replace them by the SVs which 
are closest to the tube. The SV closest to the top of the tube can be found by minimizing 
yi — (w,x;) over all points with a > 0; similarly, for the bottom SVs we minimize (w, x;) — yi 
over the points with a; > 0. We then proceed as we would with the non-bound SVs, cf. 
Problem 9.16. 
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Figure 9.5 Graphical depiction of the v-trick. Imag- 
ine increasing £, starting from 0. The first term in 
ve +45 (& + &) (cf. (9.31)) increases proportionally 
to v, while the second term decreases proportionally 


TE to the fraction of points outside of the tube. Hence, ¢ 
eG grows as long as the latter fraction is larger than v. At 
ee e: the optimum, it must therefore be < v (Proposition 9.2, 
Bo (i)). Next, imagine decreasing ¢, starting from some 
large value. Again, the change in the first term is pro- 
; portional to v, but this time, the change in the second 
: 5 term is proportional to the fraction of SVs (even points 
x on the edge of the tube contribute). Hence, £ shrinks as 
long as the fraction of SVs is smaller than v, leading 

eventually to Proposition 9.2, (ii). 


(i) v is an upper bound on the fraction of errors. 
(ii) vis a lower bound on the fraction of SVs. 


(iii) Suppose the data (9.3) were generated iid from a distribution P(x, y) = P(x)P(y|x), 
with P(y|x) continuous and the expectation of the modulus of absolute continuity of its 
density satisfying lims_,o Ex [e(ô)] = 0. With probability 1, asymptotically, v equals both 
the fraction of SVs and the fraction of errors. 


The proposition shows that 0 < v < 1 can be used to control the number of errors. 
Since the constraint (9.42) implies that (9.44) is equivalent to >; a” < Cy /2, we 
conclude that Proposition 9.2 actually holds separately for the upper and the lower 
edge of the tube, with 1/2 each. As an aside, note that by the same argument, the 
number of SVs at the two edges of the standard £-SVR tube asymptotically agree. 

The proof of Proposition 9.2 can be found in Section A.2. In its stead, we use a 
graphical argument that should make the result plausible (Figure 9.5). For further 
information on the v-trick in a more general setting, cf. Section 3.4.3. 

Let us briefly discuss how v-SVR relates to £-SVR (Section 9.1). Both algorithms 
use the e-insensitive loss function, but v-SVR automatically computes £. From a 
Bayesian viewpoint, this automatic adaptation of the loss function can be inter- 
preted as adapting the error model, controlled by the hyperparameter v (cf. Chap- 
ter 3). Comparing (9.17) (substitution of a kernel for the dot product is understood) 
and (9.41), we note that e-SVR requires an additional term —e ¥",(a* + aj), which, 
for fixed £ > 0, encourages some of the al? to be 0. Accordingly, the constraint 
(9.44), which appears in v-SVR, is not needed. The primal problems (9.8) and (9.31) 
differ in the term ve. If v = 0, then the optimization can grow e arbitrarily large, 
hence zero empirical risk can be obtained even when all a are zero. 

In the following sense, v-SVR includes e-SVR. Note that in the general case, 
using kernels, w is a vector in feature space. 


Proposition 9.3 If v-SVR leads to the solution @,w, b, then e-SVR with € set a priori to 
Z, and the same value of C, has the solution w, b. 
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Proof If we minimize (9.31), then fix € and minimize only over the remaining 
variables, the solution does not change. ] 


Using the ¢-insensitive loss function, only the patterns outside the ¢-tube enter 
the empirical risk term, whereas the patterns closest to the actual regression have 
zero loss. This does not mean that it is only the ‘outliers’ that determine the 
regression. In fact, the contrary is the case: 


Proposition 9.4 (Resistance of SV Regression) Using Support Vector Regression 
with the €-insensitive loss function (9.1), local movements of target values of points out- 
side the tube do not influence the regression. 


Proof Shifting y; locally does not change the status of (x;, yi) as being a point 
outside the tube. The dual solution a then remains feasible; which is to say it 
satisfies the constraints (the point still has of” = C/m). In addition, the primal 
solution, with £; transformed according to the movement of xj, is also feasible. 
Finally, the KKT conditions are still satisfied, as a = C/m. Thus (Chapter 6), a 
remains the solution. a 


The proof relies on the fact that everywhere outside the tube, the upper bound 
on the of) is the same. This, in turn, is precisely the case if the loss function 
increases linearly outside the ¢-tube (cf. Chapter 3 for requirements for robust 
loss functions). Inside, a range of functions is permissible, provided their first 
derivative is smaller than that of the linear part. 

In the case of v-SVR with ¢-insensitive loss, the above proposition implies that 
essentially, the regression is a generalization of an estimator for the mean of a 
random variable which 


(a) throws away the largest and smallest examples (a fraction 1/2 of either cate- 
gory — in Section 9.3, it is shown that the sum constraint (9.42) implies that Propo- 
sition 9.2 can be applied separately for the two sides, using v /2); and 


(b) estimates the mean by taking the average of the two extremal ones of the 
remaining examples. 


This resistance to outliers is close in spirit to robust estimators like the trimmed 
mean. In fact, we could get closer to the idea of the trimmed mean, which first 
throws away the largest and smallest points and then computes the mean of the 
remaining points, by using a quadratic loss inside the ¢-tube. This would leave us 
with Huber’s robust loss function (see Table 3.1). 

Note, moreover, that the parameter v is related to the breakdown point of the 
corresponding robust estimator [251]. As it specifies the fraction of points which 
may be arbitrarily bad outliers, v is related to the fraction of some arbitrary 
distribution that may be added to a known noise model without leading to a 
failure of the estimator. 

Finally, we add that by a simple modification of the loss function (cf. [594]), 
namely weighting the slack variables €“) above and below the tube in the target 
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Table 9.2 Asymptotic behavior of the fraction of errors and SVs. 

The £ found by v-SV regression is largely independent of the sample size m. The fraction of 
SVs and the fraction of errors approach v = 0.2 from above and below, respectively, as the 
number of training examples m increases (cf. Proposition 9.2). 
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Figure 9.6 v-SV regression with v = 0.2 (left) and v = 0.8 (right). The larger v allows more 
points to lie outside the tube (see Section 9.3). The algorithm automatically adjusts £ to 0.22 
(left) and 0.04 (right). Shown are the sinc function (dotted), the regression f and the tube 
fe 


function (9.31) by 2X and 2(1 — A) respectively, with A € [0,1], we can estimate 
generalized quantiles. The argument proceeds as follows. Asymptotically, all pat- 
terns have multipliers at bound (cf. Proposition 9.2). The parameter A, however, 
changes the upper bounds in the box constraints applying to the two different 
types of slack variables to 2C\/m and 2C(1 — X)/m, respectively. The equality con- 
straint (9.38) then implies that (1 — A) and A give the fractions of points (of those 
which are outside the tube) which lie on the top and bottom of the tube, respec- 
tively. 

Let us now look at some experiments. We start with a toy example, which in- 
volves estimating the regression of a noisy sinc function, given m examples (x;, yj), 
with x; drawn uniformly from [—3,3], and y; = sin(mx;)/(7x;) + vi. The v; were 
drawn from a Gaussian with zero mean and variance o”, and we used the RBF 
kernel k(x, x’) = exp(—|x — x’|?), m =50,C = 100, v = 0.2, and ø = 0.2. Standard 
deviation error bars were computed from 100 trials. Finally, the risk (or test error) 
of a regression estimate f was computed with respect to the sinc function without 
noise, as 7 i [f(x) — sin(nx) /(x)|dx. Results are given in Table 9.2 and Figures 
9.6-9.12. 
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Figure 9.7 v-SV regression on data with noise g = 0 (left) and ø = 1 (right). In both cases, 
v = 0.2. The tube width automatically adjusts to the noise (top: £ = 0, bottom: £ = 1.19). 
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Figure 9.8 £-SV regression (Section 9.2) on data with noise ø = 0 (left) and o = 1 (right). 
In both cases, £ = 0.2 — this choice, which has to be specified a priori, is ideal for neither 
case: in the upper figure, the regression estimate is biased; in the lower figure, £ does not 
match the external noise [510]. 


9.4 Convex Combinations and ¢,;-Norms 


All the algorithms presented so far involve convex, and at best, quadratic pro- 
gramming. Yet we might think of reducing the problem to a case where linear 
programming techniques can be applied. This can be done in a straightforward 
fashion [591, 517] for both SV pattern recognition and regression. The key is to 
replace the original objective function by 


Reeglf1:= jalh + C Rem lf (9.46) 
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v 
Figure 9.9 v-SVR for different values of the error constant v. Notice how £ decreases when 


more errors are allowed (large v), and that over a large range of v, the test error (risk) is 
insensitive to changes in v. 


risk 


oO ro) 


Figure 9.10 v-SVR for different values of the noise ø. The tube radius £ increases linearly 
with o (largely due to the fact that both £ and the E enter the loss function linearly). Due 
to the automatic adaptation of £, the number of SVs and of points outside the tube (errors) 
is largely independent of g, except for the noise-free case g = 0. 


where ||a||; = 2; |a;| denotes the 41 norm in coefficient space. Using the SV 
kernel expansion (9.18), 


m 


f(x) = X, aik(xi,x) +b, (9.47) 
i=1 


this translates to the objective function 


m C m 
lail + = $ ci, yi f(xi)). (9.48) 
=i i=1 


1 
Rregl f] = m m £ 


i= 
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Figure 9.11 v-SVR for different values of the constant C. The left graph shows that £ 
decreases when the regularization is decreased (large C). Only very little, if any, overfitting 
occurs. In the right graph, note that v upper bounds the fraction of errors, and lower 
bounds the fraction of SVs (cf. Proposition 9.2). The bound gets looser as C increases — 
this corresponds to a smaller number of examples m relative to C (cf. Table 9.2). 


For the ¢-insensitive loss function, this leads to a linear programming problem. 
For other loss functions, the problem remains a quadratic or general convex one. 
Therefore we limit ourselves to the derivation of the linear programming problem 
in the case of the | - |- loss function. Reformulating (9.48) yields 

rae : 1 m , c m , 
minimize z È (aitaž)+ Ss ye: +e), 
a) E ER™ DER i=1 i=l 


m 


= o*)k(x; i b— i < iy 
Aeae tby S erg (9.49) 
subject to Yi— 2 (a; — aF)K(x;,x)-b < e+G, 
J= 
ai, OF, Ein GF 2 0. 


Unlike the SV case, the transformation into its dual does not give any improve- 
ment in the structure of the optimization problem. Hence it is best to minimize 
Rreg[f ] directly, which can be achieved using a linear optimizer (see [130, 336, 555)). 

Weston et al. [591] use a similar LP approach to estimate densities on a line. 
We may even obtain bounds on the generalization error [505] which exhibit better 
rates than in the SV case [606]; cf. Chapter 12. 

We conclude this section by noting that we can combine these ideas with those 
presented in the previous section, and construct a v-LP regression algorithm [517]. 
It differs from the previous v-SV algorithm in that we now minimize 


m 


1 
Rreg + Cve = m2 a + CRémpl f] + Cre. (9.50) 
i=1 
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log,(2s°) * 
Figure 9.12 v-SVR for different values of the Gaussian kernel width 2s’, using k(x, x’) = 
exp(—|x — x'|? /(2s?)). Using a kernel that is too wide results in underfitting; moreover, since 
the tube becomes too rigid as 2s” gets larger than 1, the £ which is needed to accomodate a 
fraction (1 — v) of points increases significantly. In the plot on the right, it can again be seen 
that the speed of the uniform convergence responsible for the asymptotic statement given 
in Proposition 9.2 depends on the capacity of the underlying model. Increasing the kernel 
width leads to smaller covering numbers (Chapter 12) and therefore faster convergence. 


The goal here is not only to achieve a small training error (with respect to £), but 
also to obtain a solution with a small e. Rewriting (9.50) as a linear program yields 


m 


m 
minimize 4 È (ai + až) + £ Xi +&)+Cve, 
i= 


a €% ER" ,b,ceR i=1 
m 
— oak: x)+b-y; < , 
PA a) (x j,i) + Yi < Eth (9.51) 
. m 
subject to yi- Xlaj- o}kapx)-b < e+}, 
J= 


0. 


IV 


Qi, až, i; i E 
The difference between (9.50) and (9.49) lies in the objective function, and the fact 
that £ has now become a variable of the optimization problem. 


The v-property (Proposition 9.2) also holds for v-LP regression. The proof is 
analogous to the v-SV case, and can be found in [517]. 


9.5 Parametric Insensitivity Models 


In Section 9.3, we generalized £-SVR by estimating the width of the tube rather 
than taking it as given a priori. What we retained, however, is the assumption 
that the £-insensitive zone has a tube shape. We now go one step further and use 
parametric models of arbitrary shape [469]. This can be useful in situations where 
the noise depends on x (this is called heteroscedastic noise). 
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Let (cr (here and below, q = 1,...,p is understood) be a set of 2p positive 


functions on the input space X. Consider the following quadratic program: for 


given ae a a > 0; 


minimize r(w, £, e) = |w]? /2+ 
WEH, E ER”, e ERP,bER 
P 1 vt 
C- (Zeer +e) + a S (é+ ©) ; (9.52) 
q=1 i=1 
p 
subject to ((w, ®(x))) +b) — yi < J, egGa(xi) + &, (9.53) 
q=1 
p o 0 0 
yi — ((w, ®(xi)) +b) < E GGE, (9.54) 
q=1 
E9 >0, e® >0. (9.55) 


A calculation analogous to that in Section 9.3 shows that the Wolfe dual consists 
of maximizing (9.41) subject to (9.42), (9.43), and, instead of (9.44), the modified 
constraints, 

Yah May <c-v". (9.56) 
i=1 

which are still linear in a). In the toy experiment shown in Figure 9.13, we use 
a simplified version of this optimization problem, where we drop the term vj; 
from the objective function (9.52), and use £} and ¢, in (9.54). By this, we render 
the problem symmetric with respect to the two edges of the tube. In addition, we 
use p = 1. This leads to the same Wolfe dual, except for the last constraint, which 
becomes (cf. (9.44)) 


m 
Elai t+ a7)C(xi) < C-n. (9.57) 
i=l 
Note that the optimization problem of Section 9.3 can be recovered using the 
constant function ¢ = 1.9 

The advantage of this setting is that since the same v is used for both sides of 
the tube, the computation of £ and b is straightforward: for instance, by solving 
a linear system, using two conditions such as those described following (9.45). 
Otherwise, general statements become cumbersome: the linear system can have 
a zero determinant, depending on whether the functions ¢°”, evaluated on the 
x; with 0 < a) < C/m, are linearly dependent. The latter occurs, for instance, if 
we use constant functions ¢“) = 1. In this case, it is pointless to use two different 


5. Observe the similarity to semiparametric SV models (Section 4.8) where a modification 
of the expansion of f leads to similar additional constraints. The important difference in the 
present setting is that the Lagrange multipliers a; and a; are treated equally, and not with 
different signs as in semiparametric modelling. 
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Figure 9.13 Toy example, using prior knowledge about an x-dependence of the noise. 
Additive noise (ø = 1) was multiplied by the function sin? (27 /3)x). Left: the same function 
was used as ¢ in the context of a parametric insensitivity tube (Section 9.5). Right: v-SVR 
with standard tube. 


values v,v*, since the constraint (9.42) then implies that both sums 7", a”) are 
bounded by C - min{v,v*}. We conclude this section by giving, without proof, 
a generalization of Proposition 9.2 to the optimization problem with constraint 
(9.57): 


Proposition 9.5 Suppose we run the above algorithm on a data set with the result that 
€ > 0. Then 


(i) See) is an upper bound on the fraction of errors. 


(ii) ESO is an upper bound on the fraction of SVs. 

(iii) Suppose the data (9.3) were generated iid from a distribution P(x, y) = P(x)P(y|x), 
with P(y|x) continuous and the expectation of its modulus of continuity satisfying 
lims_,9 Ee(d) = 0. With probability 1, asymptotically, the fractions of SVs and errors equal 


v-(f C(x) dP(x))—1, where P is the asymptotic distribution of SVs over x. 


Figure 9.13 gives an illustration of how we can make use of parametric in- 
sensitivity models. Using the proper model, the estimate gets much better. In 
the parametric case, we used v = 0.1 and ¢(x) = sin? (270 /3)x), which, due to 
Jf C(x) dP(x) = 1/2, corresponds to our standard choice v = 0.2 in v-SVR (cf. Propo- 
sition 9.5). Although this relies on the assumption that the SVs are uniformly dis- 
tributed, the experimental findings are consistent with the asymptotes predicted 
theoretically: for m = 200, we got 0.24 and 0.19 for the fraction of SVs and errors, 
respectively. 
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Table 9.3 Results for the Boston housing benchmark; top: v-SVR, bottom: e-SVR. Abbrevi- 
ation key: MSE: Mean squared errors, STD: standard deviation thereof (100 trials), Errors: 
fraction of training points outside the tube, SVs: fraction of training points which are SVs. 
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Empirical studies using e-SVR have shown excellent performance on the widely 
used Boston housing regression benchmark set [529]. Due to Proposition 9.3, the 
only difference between v-SVR and standard ¢-SVR lies in the fact that different 
parameters, € vs. v, have to be specified a priori. We now describe how the 
results obtained on this benchmark set change with the adjustment of parameters 
£ and nu. In our experiments, we kept all remaining parameters fixed, with C 
and the width 2s? in k(x, x’) = exp(—||x — x’||?/(2s?)) chosen as in [482]: we used 
2s? = 0.3- N, where N = 13 is the input dimension, and C /m = 10 - 50 (the original 
value of 10 was corrected since in the present case, the maximal y-value is 50 rather 
than 1). We performed 100 runs, where each time the overall set of 506 examples 
was randomly split into a training set of m = 481 examples and a test set of 25 
examples (cf. [529]). Table 9.3 shows that over a wide range of v (recall that only 
0 < vy <1 makes sense), we obtained performances which are close to the best 
performances that can be achieved using a value of £ selected a priori by looking 
at the test set.® Finally, note that although we did not use validation techniques 
to select the optimal values for C and 2s?, the performances are state of the art: 
Stitson et al. [529] report an MSE of 7.6 for e-SVR using ANOVA kernels (cf. (13.13) 
in Section 13.6), and 11.7 for Bagging regression trees. Table 9.3 also shows that in 
this real-world application, v can be used to control the fractions of SVs and errors. 

Time series prediction is a field that often uses regression techniques. The stan- 


6. For a theoretical analysis of how to select the asymptotically optimal v for a given noise 
model, cf. Section 3.4.4. 
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9.7 Summary 
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dard method for writing a time series prediction problem in a regression esti- 
mation framework is to consider the time series as a dynamical system and to 
try to learn an attractor. For many time series z(t), it is the case that if N € N 
and 7 > 0 are chosen appropriately, then z(t) can be predicted rather well from 
(z(t — T),..-,2(t — Nr)) € R^. We can thus consider a regression problem where 
the training set consists of the inputs (z(t — T), . . ., z(t — N7)) and outputs z(t), for 
a number of different values of t. Several characteristics of time series prediction 
make the problem hard for this naive regression approach. First, time series are 
often nonstationary — the regularity underlying the data changes over time. As a 
consequence, training examples that are generated as described above become less 
useful if they are taken from the distant past. Second, the different training exam- 
ples are not iid, which is one of the assumptions on which the statistical learning 
model underlying SV regression is based. 

Nevertheless, excellent results have been obtained using SVR in time series 
problems [376, 351]. In [376], a record result was reported for a widely studied 
benchmark dataset from the Santa Fe Institute. The study combined an ¢-SVR with 
a method for segmenting the data, which stem from a time series that switches be- 
tween different dynamics. SVR using ¢-insensitive loss or Huber loss was found to 
significantly outperform all other results on that benchmark. Another benchmark 
record, on a different problem, has recently been achieved by [97]. To conclude, 
we note that SV regression has also successfully been applied in black-box system 
identification [216]. 


In this chapter, we showed how to generalize the SV algorithm to regression 
estimation. The generalization retains the sparsity of the solution through use 
of a SV expansion, exploits the kernel trick, and uses the same regularizer as its 
pattern recognition and single-class counterparts. We demonstrated how to derive 
the dual problems for a variety of loss functions, and we described variants of 
the algorithm. The LP-variant uses a different regularizer, which leads to sparse 
expansions in terms of patterns which no longer need to lie on the edge of the £- 
tube; the v-variant uses the same regularizer, but makes the loss function adaptive. 
The latter method has the advantage that the number of outliers and SVs can 
be controlled by a parameter of the algorithm, and, serendipitously, that the £- 
parameter, which can be hard to set, is abolished. 

Several interesting topics were omitted from this chapter, such as Density Esti- 
mation with SVMs [591, 563]. In this case, we make use of the fact that distribution 
functions are monotonically increasing, and that their values can be predicted with 
variable confidence which is adjusted by selecting different values of £ in the loss 
function. We also omitted the topic of Dictionaries, as introduced in the context of 
wavelets by [104] to allow a large class of basis functions to be considered simul- 
taneously, for instance kernels with different widths. In the standard SV case, this 
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can only be achieved by defining new kernels as linear combinations of differently 
scaled kernels. This is due to the fact that once a regularization operator is cho- 
sen, the solution minimizing the regularized risk function has to expanded into 
the corresponding Green’s functions of P*P (Chapter 4). In these cases, a possible 
way out is to resort to the LP version (Section 9.4). A final area of research left out 
of this chapter is the problem of estimating the values of functions at given test 
points, sometimes referred to as transduction [103]. 


9.1 (Product of SVR Lagrange Multipliers [561] e) Show that for € > 0, the solution 
of the SVR dual problem satisfies 


oja — 0 (9.58) 


for alli=1,...,m. Prove it either directly from (9.17), or from the KKT conditions. 
Show that for ¢ = 0, we can always find a solution which satisfies (9.58) and which is 
optimal, by subtracting min{ aj, a; } from both multipliers. 
Give a mechanical interpretation of this result, in terms of forces on the SVs (cf. 
Chapter 7). 


9.2 (SV Regression with Fewer Slack Variables ee) Prove geometrically that in SV 
regression, we always have &&* = 0. Argue that it is therefore sufficient to just introduce 
slacks £; and use them in both (9.9) and (9.10). Derive the dual problem and show that it 
is identical to (9.17) except for a modified constraint 0 < a; +a; < C. Using the result of 
Problem 9.1, prove that this problem is equivalent to(9.10). 

Hint: although the number of slacks is half of the original quantity, you still need both 
a; and a; to deal with the constraints. 


9.3 (v-Property from the Primal Objective Function e) Try to understand the v- 
property from the primal objective function (9.31). Assume that at the point of the solution, 
e > 0, and set (0/0e)r(w, £) equal to 0. 


9.4 (One-Sided Regression ee) Consider a situation where you are seeking a flat func- 
tion that lies above all of the data points; that is, a regression that only measures errors 
in one direction. Formulate an SV algorithm by starting with the linear case and later 
introducing kernels. Generalize to the soft margin case, using the v-trick. Discuss the ap- 
plicability of such an algorithm. Also discuss how this algorithm is related to v-SVR using 
different values of v for the two sides of the tube. 


9.5 (Basis Pursuit ee) Formulate a basis pursuit variant of SV regression, where, start- 
ing from zero, SVs are added iteratively in a greedy way (cf. [577]). 


9.6 (SV Regression with Hard Constraints e) Derive dual programming problems for 
variants of e-SVR and v-SVR where all points are required to lie inside the e-tubes (in 
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other words, without slack variables £;i). Discuss how they relate to the problems that can 
be obtained from the usual duals by letting the error penalization C tend to infinity. 


9.7 (Modulus of Continuity vs. Margin e) Discuss how the e-margin (Definition 9.1) 
is related (albeit not identical) to the modulus of continuity of a function: given ô > 0, 
the latter measures the largest difference in function values which can be obtained using 
points within a distance 6 in E. 


9.8 (Margin of Continuous Functions [481] e) Give an example of a continuous func- 
tion f for which m.(f) is zero.” 


9.9 (Margin of Uniformly Continuous Functions [481] e) Prove that m -(f) (Defini- 
tion 9.1) is positive for all e > 0 if and only if f is uniformly continuous.® 


9.10 (Margin of Lipschitz-Continuous Functions [481] e) Prove that if f is Lipschitz- 
continuous, meaning that if there exists some L > 0 such that for all x,x' € E, 
If) = fle < L- [lz "lle, then me > F. 


9.11 (SVR as Margin Maximization [481] e) Suppose that E (Definition 9.1) is en- 
dowed with a dot product (., .) (generating the norm ||.||£). Prove that for linear functions 
(9.2), the margin takes the form m-(f) = Tra Argue that for fixed e > 0, maximizing the 


wl" 
margin thus amounts to minimizing ||w||, as done in SV regression with hard constraints. 


9.12 (¢-Margin and Canonical Hyperplanes [481] e) Specialize the setting of Prob- 
lem 9.11 to the case where X = {x1,...,Xm}, and show that m,(f) = rel is equal to 


(twice) the margin defined for Vapnik’s canonical hyperplane (Definition 7.1). Argue 
that the parameter ¢ is superfluous in pattern recognition. 


9.13 (SVR for Vector-Valued Functions [481] 000) Assume E = RN. Consider linear 
functions f(x) = Wx + b, with W being an N x N matrix, and b € RN. Give a lower 
bound on m,(f) in terms of a matrix norm compatible [247] with ||.|| £, using the solution 
of Problem 9.10. 

Consider the case where the matrix norm is induced by ||.||r, which is to say there exists 
a unit vector z € E such that ||Wz||¢ = ||W||. Give an exact expression for m.(f). 

Show that for the Hilbert-Schmidt norm ||W]|2 = ,/XNi—1 We, which is compatible with 
the vector norm ||.||2, the problem of minimizing ||W|| subject to separate constraints for 
each output dimension separates into N regression problems. 


7. A function f : E > Gis called continuous if for every ô > 0 and x € E, there exists an € > 0 
such that for all x! € E satisfying ||x — x'||p < €, we have || f(x) — f(x’)||c < ô. 

8. A function f : E> Gis called uniformly continuous if for every ô > 0 there exists an € > 0 
such that for all x,x' € E satisfying ||x — x'||e < €, we have || f(x) — f(x’)||c < ô. 
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Try to conceive more interesting cases where the regression problems are coupled.? 


9.14 (Multi-Class Problems ooo) Try to generalize m,(f) to multi-class classification 
problems, and use it to conceive useful margin maximization algorithms for this case. 


9.15 (SV Regression With Overall ¢-Insensitive Loss ee) Instead of (9.5), consider 
the objective function 


Dine os 1 
slwl? +c- = (9.59) 


2, lv: = foo) 


E 

Note that this allows for an overall 4, error of € which is “for free.” Therefore, poor 
performance on some of the points can, to some extent, be compensated for by high 
accuracies on other points (Figure 9.14). Show that this leads to the kernelized dual 
problem of maximizing 


: m : 1 vt . k 

W(a), 8) = X (až — ayi — Be - 5 ¥ (a; - ai)(a} — a;j)k(xi,x;), (9.60) 
i=1 ij=1 

subject to 

Y(ai— at) =0, 0<a® < = 0<8<C. (9.61) 


i=1 


Hint: Introduce slacks nf > 0 which measure the deviation at each point, and put an £- 
insensitive constraint on their sum. Introduce another slack € > 0 for allowing violations 
of that constraint, and penalize C£ in the primal objective function. 


9.16 (Computation of £ and b in v-SVR e) Suppose i and j are the indices of two 
points such that 0 < a; < C/m and 0 < af < C/m (“in-bound SVs”). Compute € and 
b by exploiting that the KKT conditions imply (9.32) and (9.33) become equalities with 
&i = 0 and € = 0. 


9.17 (Parametric v-SVR Dual ee) Derive the dual optimization problem of v-SVR with 
parametric loss models (Section 9.5). 


9.18 (Parametric v-Property eee) Prove Proposition 9.5. 


9.19 (Heteroscedastic Noise eee) Combine v-SVR using parametric tubes with a vari- 
ance (e.g., [488]) or quantile estimator (Section 9.3) to construct a SVR algorithm that can 
deal with heteroscedastic noise. 


9. Cf. also Chapter 4, where it is shown that under certain invariance conditions, the 
regularizer has to act on the output dimensions separately and identically (that is, in a scalar 
fashion). In particular it turns out that under the assumption of quadratic homogeneity and 
permutation symmetry, the Hilbert-Schmidt norm is the only admissible norm. 

10. This problem builds on joint work with Bob Williamson. 
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Figure 9.14 Solid line: SVR with overall <-insensitive loss, dashed lines: standard £-SVR, 
with +e tube. m = 25 data points were generated as follows: x-values drawn from uniform 
distribution over [—3, 3], y-values computed from y = sin(71x)/(7x); to create two outliers, 
the y values of two random points were changed by +1. The SV machine parameters were 
£ =0.1,C = 10000. For this value of C, which is essentially the hard margin case, the original 
SVM does a poor job as it tries to follow the outliers. The alternative approach, with a value 
of £ adjusted such that the overall 4, error is the same as before, does better, as it is willing 
to “spend” a large part of its £ on the outliers. 


9.20 (SV Regression using v and £ coo) Try to come up with a formulation of SV re- 
gression which uses v and e rather than C and e (¢-SVR) or C and v (v-SVR). 


9.21 (v-SV Regression with Huber’s Loss Function occ) Try to generalize v-SV re- 
gression to use loss functions other than the é-insensitive one, such as the Huber loss, 
which is quadratic inside the €-tube and linear outside (cf. Chapter 3). 

Study the relationship between v and the breakdown point of the estimator [251]. 


9.22 (Relationship to “Almost Exact Interpolation” 000) Discuss the relationship 
between SVR and Powell’s algorithm for interpolation with thin plate spline kernels [422]. 
Devise a variant of Powell’s algorithm that uses the e-insensitive loss. 
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Overview 


Prerequisites 


Implementation 


This chapter gives an overview of methods for solving the optimization problems 
specific to Support Vector Machines. Algorithms specific to other settings, such 
as Kernel PCA and Kernel Feature Analysis (Chapter 14), Regularized Principal 
Manifolds (Chapter 17), estimation of the support of a distribution (Chapter 8), 
Kernel Discriminant Analysis (Chapter 15), or Relevance Vector Machines (Chap- 
ter 16) can be found in the corresponding chapters. The large amount of code and 
number of publications available, and the importance of the topic, warrants this 
separate chapter on Support Vector implementations. Moreover, many of the tech- 
niques presented here are prototypical of the solutions of optimization problems 
in other chapters of this book and can be easily adapted to particular settings. 

Due to the sheer size of the optimization problems arising in the SV setting we 
must pay special attention to how these problems can be solved efficiently. In Sec- 
tion 10.1 we begin with a description of strategies which can benefit almost all cur- 
rently available optimization methods, such as universal stopping criteria, caching 
strategies and restarting rules. Section 10.2 details low rank approximations of the 
kernel matrix, K € R”*”". These methods allow the replacement of K by the outer 
product ZZ! of a “tall and skinny” matrix Z € R"*" where n < m. The latter can 
be used directly in algorithms whose speed improves with linear Support Vector 
Machines (SMO, Interior Point codes, Lagrangian SVM, and Newton’s method). 

Subsequently we present four classes of algorithms; interior point codes, sub- 
set selection, sequential minimization, and iterative methods. Interior Point meth- 
ods are explained in Section 10.3. They are some of the most reliable methods for 
moderate problem sizes, yet their implementation is not trivial. Subset selection 
methods, as in Section 10.4, act as meta-algorithms on top of a basic optimiza- 
tion algorithm by carving out sets of variables on which the actual optimization 
takes place. Sequential Minimal Optimization, presented in Section 10.5, is a spe- 
cial case thereof. Due to the choice of only two variables at a time the restricted 
optimization problem can be solved analytically which obviating the need for an 
underlying base optimizer. Finally, iterative methods such as online learning, gra- 
dient descent, and Lagrangian Support Vector Machines are described in Section 
10.6. Figure 10.1 gives a rough overview describing under which conditions which 
optimization algorithm is recommended. 

This chapter is intended for readers interested in implementing an SVM them- 
selves. Consequently we assume that the reader is familiar with the basic concepts 
of both optimization (Chapter 6) and SV estimation (Chapters 1, 7, and 9). 
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Figure 10.1 A decision tree for selecting suitable optimization algorithms. Most kernel 
learning problems will be batch ones (for the online setting see Section 10.6.3). For small 
and medium sized problems, that is as long as the kernel matrix fits into main memory, 
an interior point code is recommended, since it produces optima of very high quality. For 
larger problems approximations are required which leads to sparse greedy approximation 
schemes or other reduced set methods. An alternative strategy, which is particularly attrac- 
tive if the size of the final kernel expansion is not important, can be found in subset selection 
methods such as chunking and SMO. 
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Notation Issues 


Knowledge of Lagrangians, duality, and optimality conditions (Section 6.3.1) is 
required to understand both the stopping rules in more detail and the section on 
interior point methods. Essentially, this chapter builds on the general exposition 
of Section 6.4. In addition, some of the methods from subset approximation rely 
on the randomized optimization concepts of Section 6.5. A basic understanding of 
fundamental concepts in numerical analysis, for example the notion of a Cholesky 
decomposition, will also prove useful, yet it is not an essential requirement. See 
textbooks [530, 247, 207, 46] for more on this topic. 

Note that in this chapter we will alternate between the v and the C-formulation 
(see Section 7.5), for different algorithms. This is due to the fact that some algo- 
rithms are not capable of treating the v-formulation efficiently. Furthermore, the 
C-formulation differs from the one of minimizing the regularized risk functional 
insofar as we minimize 


m 


CY (xi, yi, f(x) + OFF] (10.1) 


i=l 


rather than 


m 


= È etri yn fD +AA. (10.2) 


We can transform one setting into the other via C = +. The C notation is used in 
order to be consistent with the published literature on SV optimization algorithms. 


10.1 Tricks of the Trade 


We start with an overview of useful “tricks,” ; modifications from which almost any 
algorithm will benefit. For instance, techniques which are useful for speeding up 
training significantly or which determine when the algorithm should be stopped. 

This section is intended both for the practitioner who would like to improve an 
existing SV algorithm and also for readers new to SV optimization, since most of 
the tools developed prove useful in the optimization equations later. We present 
three tricks; a practical stopping criterion, a restart method, and an overview of 
caching strategies. 


10.1.1 Stopping Criterion 


It would be ideal if we always were able to obtain the solution by optimization 
methods (e.g., from Section 6.4). Unfortunately, due to the size of the problem, this 
is often not possible and we must limit ourselves to approximating the solution by 
an iterative strategy. 

Several stopping criteria have been suggested regarding when to stop training 
a Support Vector Machine. Some of these focus mainly on the precision of the 
Lagrange multipliers a; [266, 409, 494], whereas others [514, 459] use the proximity 
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Proximity in 
Parameters 4 
Proximity in 
Solution 


Implementation 


of the values of the primal and dual objective functions. Yet others stop simply 
when no further improvement is made [398]. 

Before we develop a stopping criterion recall that ultimately we want to find a 
solution f(x) = (w, ®(x)) + b that minimizes one of the regularized risk functionals 
described below. In the case of classification, 


ee. ik 1 
minimize CY clé) + zlliwl? > clé) + + iwl -vp 
mi i=1 
subject to yi f (xi) > 1— é; or re Spot (10.3) 
& 20 &>0,pER 


(the right half of the equations describes the analogous setting with the v- 
parameter), similarly, for regression, 


m 


minimize Cae y+ c(é*) + + 5llwl? c$ ee) )+e(E + + 5llwl? — ve 
subject to fle) > yi—e— & or fe) > yi-e- &} (10.4) 
f(x) <yitet+G f(xi)<yitet+G 
fae, 20 &,€ >0,eER 


This means that ultimately not the Lagrange multipliers a; but rather w, or only 
the value of the primal objective function, matters. Thus, algorithms [266, 290, 
291, 398] which rely on the assumption that proximity to the optimal parameters 
will ensure a good solution may not be using an optimal stopping criterion. In 
particular, such a criterion may sometimes be overly conservative, especially if the 
influence of individual parameters on the final estimate is negligible. For instance, 
assume that we have a linear dependency in the dual objective function. Then 
there exists a linear subspace of parameters which would all be suitable solutions, 
leading to identical vectors w. Therefore, convergence within this subspace may 
not occur and, even if it does, it would not be relevant to the quality of the solution. 
What we would prefer to have is a way of bounding the distance between the 
objective function at the current solution f and at fop:. Since (10.3) and (10.4) are 
both constrained optimization problems we may make use of Theorem 6.27 and 
lower bound the values of (10.3) and (10.4) via the KKT Gap. The advantage is 
that we do not have to know the optimal value in order to assess the quality of the 
approximate solution. The following Proposition formalizes this connection. 


Proposition 10.1 (KKT-Gap for Support Vector Machines) Denote by f the (possi- 
bly not optimal) estimate obtained during a minimizing procedure of the optimization 
problem (10.3) or (10.4) derived from the regularized risk functional R reg[ f |. Further, de- 
note by fopt the minimizer of Rreg[ f]. Then under the condition of dual feasible variables 
(namely that the equality and box constraints are satisfied), the following inequality holds: 


Rreg[f] > Rreglf*] > Rreg[ f] — = Gapi f] (10.5) 
where Gap[f] is defined as follows: 
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1. In the case of classification with C-parametrization 


GapLf] = Ý, max(0, 1 — y;f(x))COz,e(max(0,1 — yj flee) + ayy if (x) —1) 
j=l 


Cmax(0,1— yif (xp) + ajf- 1) (10.6) 
j=l 
2. For classification in the v-formulation 


m 


Gap[f] = È, max(0, p— yif (xj) + ajy f(x) — p) (10.7) 
j=l 


3. For €-regression, where £j = max(0, yi — f(xi) — €) and f = max(0, f (xi) — yi — £), 
Gap[ f] = È, éC ôE) + ECEE) + aie + f(x) y) Hayle — f(x) +y;) 
j=l 


= LCE + CE +ajle + f(x) —y) + oile- fle) +4) (10.8) 
j=l 
4. In the v-formulation the gap is identical to (10.8), with the only difference being that 
C = 1 and that e is a variable of the optimization problem. 


Here p = 1 is a constant in the C-formulation and C = 1 is one in the v-formulation. 
For regression we denote by ¢(€) the nonzero branch of c(xj, yi, f(xi)) which for the £- 
insensitive regression setting becomes ¢(€) = £. Finally, note that in the v-regression 
formulation, £ is a variable. 


Such a lower bound on the minimum of the objective function has the added 
benefit that it can be used to devise a stopping criterion. We simply use the same 
strategy as in interior point codes (Section 6.4.4) and stop when the relative size of 
the gap is below some threshold e, that is, if 
Gap[f] < e Bresl fll + [Rrest fl — Gapl ia = 
Proof All we must do is apply Theorem 6.27 by rewriting (6.60) in terms of the 
currently used expressions and subsequently find good values for the variables 
that have not been specified explicitly. This will show that the size of the KKT-gap 
is given by (10.6), (10.7), and (10.8). 

The first thing to note is that free variables do not contribute directly to the 
size of the KKT gap provided the corresponding equality constraints in the dual 
optimization problem are satisfied. Therefore it is sufficient to give the proof only 
for the C-parametrization — the v-parametrization simply uses an additional 
equality constraint due to an extra free variable. 

Rewriting (6.60) in terms of the SV optimization problem means that now w and 
€ are the variables of the optimization problem and x;, y; are merely constants. We 
review the constraints, 


(10.9) 


0 > p-&-yif(xi) and0 > —é; (classification) (10.10) 
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—f(x)+yi-—e—€ and0 
+f(xi)-—yi-e-€ and0 


Since the optimizations take place in dual space (the space of the corresponding 
Lagrange multipliers aj, 7), we can use the latter directly in our bound. What 
remains to be done is to find €;, since f is determined by a;. The constraint imposed 
on &; is that f and the corresponding a; satisfy the dual feasibility constraints 
(6.53). We obtain 


—& (regression) 


> > 
02 = (10.11) 
0 > 2 


m z 1 m 
ôs L, É, a,n) = Og, (cfa + zllwl? + £ Qi (1 a Ei = yif (xi) = né) 
i=1 i=1 
= Cz, lCE;) = Qj- = 0 (10.12) 
for classification. Now we have to choose nj, éj such that Cd:,0(£j) = aj + nj is 
satisfied. At the same time we would like to obtain good lower bounds. Hence we 


must choose the parameters in such a way that the KKT gap (6.53) is minimal, that 
is, all the terms 


KKT} := njéj + aj (y;fœ) —1 + &)) (10.13) 
= (Cd, (3) - aj) &j +a; (yif) — 14+ &)) (10.14) 
= Cg, lE) +a; (yif 1) (10.15) 


are minimized. The second term is independent of €; and the first term is mono- 
tonically increasing with €; (since č is convex). The smallest value for €; is given 
by (10.10). Combining of these two constraints gives 


Ej = max(0,1— yjf(x))). (10.16) 


Together (10.16), (10.13), and (6.53) prove the bound for classification. Finally, 
substituting ¢(€;) = é; (the soft-margin loss function [40, 111]) yields (10.6). For 
regression we proceed analogously. The optimality criteria for €j, €; and nj, n} are 


Cô; ECE) — aj — nj = 0 and CO¢-C(E;) =0 = =0 (10.17) 

In addition, from (6.53), we obtain 

KKT; := nj +6} +0 (+6 + fle) —y)) +o (e+ G—fle)+yj) 0018) 
= €jCO¢,c(E;) + Ej COe-c(6)) +a;jle+ f(x) — yj) + ade + yj; — f(x). (10.19) 

By similar reasoning as before we can see that the optimal €; is given by 

£j =max(0, yi — f (xi) — €) and £ = max(0, f(x) — yi — €) (10.20) 

which completes the proof. E 


The advantage of (10.6) and (10.8) is that they can be computed in O(m) time pro- 
vided that the function values f(x;) are already known. This means that conver- 
gence checking can be done for almost no additional cost with respect to the over- 
all cost of the training algorithm. 
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An important thing to remember is that for algorithms which minimize only the 
dual or only the primal objective function the size of the gap may grow between 
optimization steps. This is since an improvement in the primal objective does 
not necessarily imply an improvement in the dual and vice versa. One way to 
overcome this problem (besides a redesign of the optimization algorithm which 
may be out of question in most cases) is to note that it immediately follows from 
(10.5) that 


min Rreg fi] > Rregl fopt] > maX [Rregl fi] m Gap[f;]] (10.21) 


where f; is the estimate at iteration i. In many algorithms such as SMO, where 
the dual gap can fluctuate considerably, this leads to much improved bounds 
on Rreg[fopt] compared to (10.5). In experiments a gap-optimal choice of b led to 
decreased, but still existing, fluctuations. See also [291, 494] for details how such 
an optimal value of b can be found. 


10.1.2 Restarting with Different Parameters 


Quite often we must train a Support Vector Machine for more than one specific pa- 
rameter setting. In such cases it is beneficial to re-use the solution obtained for one 
specific parameter setting in finding the remaining ones. In particular, situations 
involving different choices of regularization parameter or different kernel widths 
benefit significantly from this parameter re-use as opposed to starting from f = 0. 
Let us analyze the situation in more detail. 


Restarting for C: Denote by fc the minimizer of the regularized risk functional 
(slightly modified to account for C rather than for A) 


m 


Rregl f, C1:= C X c(i, yi, f(x) + QIf]. (10.22) 


i=1 


By construction 
Rreg [fe,C’] > Rreg [fo C] > Rreg [fo C] > Reel fe, C] forall C’ > C. (10.23) 


The first inequality follows from the fact that fc is the minimizer of Rreg [f,C’]. 
The second inequality is a direct consequence of C’ > C, and, finally, the third 
inequality is due to the optimality of fc. Additionally, we conclude from (10.22) 
that 


Cc’ 
Rreg [fæ C] < Rreg [fc,C] + (C — C)mRemp [fc] < T Rreg [fc,C] (10.24) 
and thus 
C 
T Rreg [fæ C] < Rreg [fe, C] < Rreg [fe C]. (10.25) 


In other words, changes in the regularized risk functional Rreg| f,C] are bounded 
by the changes in C. This has two implications; first, it does not make sense to use 
an overly fine grid in C when looking for minima of Rregl f]. Second, (10.23) shows 
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that for changes of C into C’ which are not too large it is beneficial to re-use fc 
rather than to restart from f = 0. 

In practice it is often advantageous to start with a solution for large C and keep 
on increasing the regularization (decrease C). This has the effect that initially most 
of the aj, a; will be unconstrained. Subsequently, all the variables that have been 
found to be constrained will typically tend to stay that way. This can dramatically 
speed up the training phase by up to a factor of 20, since the algorithm can focus 
on unconstrained variables only. See [502], among others, for experimental details. 
In order to satisfy the dual constraints it is convenient to rescale the Lagrange mul- 
tipliers in accordance with the change in C. This means we rescale each coefficient 
by a = £a;, where a! are the start values when training with C’ instead of C. Such 
a modification leaves the summation constraints intact. See also [288] for further 
details on how to adjust parameters for changed values of C. 


Restarting for o: Gaussian RBF kernels (2.68) are popular choices for Support 
Vector Machines. Here one problem is to adapt the width a of the kernel suitably. 
In [123] it is shown that for the soft-margin loss function the minimizer (and its 
RKHS norm) of the regularized risk is a smooth function of ø. 

More generally, if the regularization operator only changes smoothly, we can 
employ similar reasoning to that above. Note that in the current case not only 
the regularizer but also f itself changes (since we only have a representation of 
f via ai). Yet, unless the change in ø is too large, the value of the regularized risk 
functional will be smaller than for the default guess f = 0, hence it is advantageous 
to re-use the old parameters aj, b. 


10.1.3 Caching 


A simple and useful trick is to store parts of the kernel matrix K;;, or also f(x;), for 
future use when storage of the whole kernel matrix K is impossible due to memory 
constraints. We have to distinguish between different techniques. 


Row Cache: This is one of the easiest techniques to implement. Usually we allo- 
cate as much space for an m; x m matrix as memory is available. Simple caching 
strategies as LRU (least recently used — keep only the most recently used rows of 
K in the cache and update the oldest rows first) can provide an 80% — 90% hit rate 
(= fraction of elements found in the cache) with a cache size of 10% of the original 
matrix. See, for example, [266, 309, 494, 134] for details. Row cache strategies work 
best for sequential update and subset selection methods such as SMO (see Section 
10.5). Moreover, we can expect significant improvement via row cache strategies if 
the number of non-bound Lagrange multipliers a; € (0, C) is small, since these are 
the parameters revisited many times. 


Element Cache: A more fine-grained caching strategy would store individual ele- 
ments of K rather than entire rows or columns. This has the additional advantage 
that for sparse solutions, where many a; = 0, possibly all relevant entries K;; can be 
cached [172]. The downside is that the organization of the cache, as, for example, 
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a list, is considerably more complex and that this overhead! may easily outweigh 
the improvement in the hit rate in terms of kernel entries. 


Function Cache: If only very few Lagrange Multipliers a; change per iteration 
step, we may update f(x ;) (which is useful for the KKT stopping criteria described 
in Section 10.1.1) cheaply. 

Assume that a set a1,...,@, of Lagrange multipliers is changed (without loss of 
generality we pick the first n). Then f(x;) can be rewritten as 


PM) = E akan) +b = fH) + Ea -akana (10.26) 
i=1 i=1 


Note that in order to prevent numerical instabilities building up too quickly, it is 
advisable to update (10.26) in the way displayed rather than computing 


F(x) Y af k(x}, x)) — 2 ap k(x, x;). 
i=1 i= 


After several updates, depending on the machine precision, it is best to recalculate 
f(x;) from scratch. 


10.1.4 Shrinking the Training Set 


While it may not always be possible to carry out optimization on a subset of 
patterns right from the beginning, we may, as the optimization progresses, drop 
the patterns for which the corresponding Lagrange multipliers will end up being 
constrained to their upper or lower limits. 

If we discard those patterns x; with a; = 0 this amounts to effectively reduc- 
ing the training set (see also Section 10.4.2 for details and equations). This is in 
agreement with the idea that only the Support Vectors will influence the decision 
functions. There exist several implementations which use such subset selection 
heuristics to improve training time [559, 111, 561, 463, 409, 172]. 

We describe another example of subset methods in Section 10.3 where we ap- 
ply subset selection to interior point methods. In a nutshell the idea is that with 
decreasing KKT terms (10.13) either the constraint must be satisfied and the corre- 
sponding Lagrange multiplier vanishes or the constraint must be met exactly. 

Finally, assigning sticky-flags (cf. [85]) to variables at the boundaries also im- 
proves optimization. This means that once a variable is determined to be bound 
constrained it will remain fixed for the next few iterations. This heuristic avoids 
oscillatory behavior during the solution process. 


1. Modern microprocessor architectures are largely limited by their memory bandwidth 
which means that an increased hardware cache miss rate due to non-contiguous storage 
of the matrix elements may affect performance quite dramatically. Furthermore such a 
strategy will also lead to paging operations of the operating system. 
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Low 
Dimensional 
Approximation is 
Sufficient 


Approximation 
in Feature Space 


In the following we describe what can be thought of as another useful trick. The 
practical significance warrants a more detailed description however. The reader 
only interested in the basic optimization algorithms may skip this section. 

Sparse greedy approximation techniques are based on the observation that typ- 
ically the matrix K has many small eigenvalues which could easily be removed 
without sacrificing too much precision.” This suggests that we could possibly find 
a subset of basis functions k(x;, x) which would minimize the regularized risk func- 
tional Ryeg[f] almost as well as the full expansion required by the Representer The- 
orem (Th. 4.2). This topic will be discussed in more detail in Chapter 18. 


10.2.1 Sparse Approximations 


In one way or another, most kernel algorithms have to deal with the kernel matrix 
K which is of size m x m. Unfortunately the cost of computing or of storing the 
latter increases with O(m?) and the cost of evaluating the solution (sometimes 
referred to as the prediction) increases with O(m). Hence, one idea is to pick some 
functions k(x1,+),...,k(Xn,+) (for notational convenience we chose the first n, but 
this assumption will be dropped at a later stage) with n < m such that we can 
approximate every single k(x;,-) by a linear combination of k(x1,-),...,k(%n,:). 
Without loss of generality we assume that x1, . . . , Xn are the first n patterns of the 
set {x1,...,Xm}. We approximate? 


K(x, ) = Bile) == F akl). (10.27) 
j=l 


As an approximation criterion we choose proximity in the Reproducing Kernel 
Hilbert Space H, hence we choose a;j such that the approximation error 


2 


k(x;) — y Qijk(xj,-) 
j=l 


n n 
= k(x; xi) — 2 X aijk(xi,x)) + X aijaak(xj, xi) 
m = j=l 


is minimized. An alternative would be to approximate the values of k(x;,-) on 
X directly. The computational cost of doing the latter is much higher however 
(see Problem 10.4 and [514] for details). Since we wish to optimize the overall 
approximation quality we have to minimize the total approximation error for all i, 


2. This can be seen from the results in Table 14.1. 
3. Likewise we could formalize the problem as one of approximating patterns mapped into 


m 
feature space; we approximate ®(x;) by ®(x;) := >, ajj®(x;) and measure the goodness-of- 
i=l 
fit via ||®(x;) — &(x;)||?. For a streamlined notation and to emphasize to fact that we are 
approximating a function space we will, however, use the RKHS notation. 
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giving 
m m 
Err(a): Ri lka) — kille = X, Kii — 25 aijKij + 5 aijaipKjj. (10.28) 
i j=1 jj'=1 

Here we use (as before) Kj; := k(x;, xj). Since Err(a) is a convex quadratic function 
in a, all we must do is set the first derivative of Err(a) to zero in order to find 
the minimizer of Err(a). Note that in the present case a € R”*” is a matrix and 
therefore, with some abuse of notation, we will use 0, to denote the derivative 
with respect to all matrix elements. The minimizer of (10.28) satisfies 


Q,Err(a) = —2K"" + 20K" =0. (10.29) 


Here K™" is an m x n matrix with Kj" = Kij, so K™" is the left sub-matrix of K. 
Likewise K”” € R'*” is the upper left sub-matrix of K. This leads to 


aK (10.30) 


We can exploit this property in order to determine the minimal approximation 
error Err(aopt) and properties of the matrix K where K;; = (k;,k;). The following 
theorem holds. 


Theorem 10.2 (Properties of k and K) With the above definitions and (10.30), the ma- 
trices K, K, and K — K are positive definite and 


Err(Qopt) = tr K — tr K, (10.31) 
where K = KAKOYE- K T = a(K™")T = aK" a!. 


This means that we have an approximation of K in such a way that K is strictly 
smaller than K (since K — K is positive definite as well) and, furthermore, that the 
approximation error in terms of H can be computed cheaply by finding tr K — tr K, 
that is by looking at only m elements of the m x m matrix K and K. 

Finally, we obtain a bound on the operator norm of K — K by computing the 
trace of the difference, since the trace is the sum of all eigenvalues, and the latter 
are nonnegative for positive matrices. In particular, for positive definite matrices 
K (and their eigenvalues ;) we have 


m m 
IIK|| = max A; < ||Klprop = tr KK! = (3 x) <rk= > Ap (10.32) 
i i=l 
Note that the Frobenius norm ||K||fro is simply the 2-norm of all coefficients of K. 
(For symmetric matrices we may decompose K into its eigensystem via K = U'AU 
where U is an orthogonal matrix and A is a diagonal matrix. This allows us to write 


tr KKT = tr WU AUUTAU =tr 4°.) 


Proof We prove the functional form of K first. By construction we have 


Kij = (ki skj) s5 aaju (k(X1, *) key) = 5 ay jy Kip (10.33) 
Ll/=1 L=] 


290 


Rank-1 Update 


Implementation 


and, therefore, by construction of a 
K = aKa! = KOOR KARA RK" — KEKS (R 


Next note that for optimal a we have (with k; as a shorthand for the vector 

(Kī, seky Kin)) 

kiep kihe = Kir — E tT ROKK k (10.34) 
= Ki — k] (K"")"k; = Ky — Ky. (10.35) 


Summation over i proves the first part of (10.31). What remains is to prove positive 
definiteness of K, K, and K — K. As Gram matrices both K and K are positive 
definite. To prove that K — K is also positive definite we show that 


K— K = k where Kij = (k(xi, -) = kil), k(x), -) = ki) (10.36) 


All we have to do is substitute the optimal value of a, i.e., a = K”” (K)! into the 
definition of K to obtain 


K = K—2K™"g a ak" aq! —K—K™ ay Kael ZK È. (10.37) 
E 


Note that (10.36) also means that (k(x;, +) — ki(-), kæ j, ) — kj(-)) = (klx), klx; -)} — 
(ki(-), ki(-)). 


10.2.2 Iterative Methods and Random Sets 


While (10.31) can tell us how well we are able to approximate a set of m kernel 
functions k(x;,-) by a subset of size n, it cannot be used to predict how large 
n should be. Let us first assume that we choose to pick the subset of kernel 
functions k(x;, -) at random to approximate the full set, as suggested in the context 
of Gaussian Processes [603]. We will present a more efficient method of choosing 
the kernel functions in the next section but, for the moment, assume that the 
selection process is completely random. 

Given that, for some n, we have already computed the values of (k™)=" and 
Qopt = (K"")7!K™". For an additional kernel function, say k(xn41,-) we need to 
find a way to compute the values of ap and (K"+1”"+1)-1 efficiently (since the 
difference between K”” and K"*!"+1 is only of rank 1 such a change is commonly 
referred to as a rank-1 update). We may either do this directly or use a Cholesky 
decomposition for increased numerical stability. For details on the latter strategy 
see [423, 530, 247] and Problem 10.5. 

Denote by k € R" the upper right vector (Kn+1,1,.--, Kn+1,n) of the matrix 
K+] € RO+Dx+) to be inverted, and « := Kn41,n41. Then we have 


Kan k! | _ 


(K"")-1 4 n-1wvl —n-lv 


-n'v n! 


(10.38) 


Kita -1 = 
oe ae 
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where 7 = (k — k'(K"")~'k) and v = ((K"")~'k). This means that computing 
(K"+1n+1)-1 costs O(n?) operations once (K"")~! is known. Next we must update 
a. Splitting K"™”+! into K™” and k = (Kin41,---Kimn41) yields 


a= Kitt (aes (10.39) 
= [KeK $ n7! [K""v e k] vi, =n! [K""v — k]| . (10.40) 


Computing (10.39) involves O(mn) operations since K”"(K"")—" is the old value of 
a for the case of n basis functions and the most expensive part of the procedure. 
Computing K””v requires only O(n) operations and the approximation error 
Err(a) can be computed efficiently. It is given by 
m n 
Err(a) = tr K — tr a(K""*!)" = tr K— È Ñ ai;jKij. (10.41) 
i=1 j=1 
Since K = K””+1a! we only have to account for the changes in a and the addi- 
tional row due to K”*"*!, Without going into further details one can check that 
(10.41) can be computed in O(m) time, provided the previous value of n is already 
known. 

Overall, successive applications of a rank-1 update method to compute a sparse 
approximation using n kernel functions to approximate a set of m incurs a com- 
putational cost of O (£7, mn’) = O(mn?). One can see that this is only a constant 
factor worse than a direct calculation. Besides that, the memory footprint of the 
algorithm is also only O(mn) which can be significantly smaller than storage of 
the matrix K, namely O(m’). 

Note the similarity in computational cost between this iterative method and 
Conjugate Gradient Descent methods (Section 6.2.4) where the inverse of K was 
effectively constructed by building up a n-dimensional subspace of conjugate 
directions. The difference is that we never actually need to compute the full matrix 
K4 


10.2.3 Optimal and Greedy Selections 


We showed that the problem of finding good coefficients œ can be solved effi- 
ciently once a set of basis functions k(x1,-),.. .,k(Xn,-) is available. The problem 
of selecting a good subset is the more difficult issue. One can show that even rel- 
atively simple problems such as one-target optimal approximation are NP-hard 
[381]; we cannot expect to find a solution in polynomial time. 

We can take a greedy approach in the spirit of [381, 474] (see Section 6.5.3), 
with the difference being that we are not approximating one single target function 
but a set of m basis functions. This means that, rather than picking the functions 


4. Strictly speaking, it is also the case that conjugate gradient descent does not require 
computation of K but only of Ka for a € R". 
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4 Figure 10.2 Size of the 
—— subset size 1, 2, 5, 10, 20, 50, 59, 100, 200, 500 ] residuals log, tr(K = Š) for 
the Abalone dataset from the 
UCI repository [56]. From 
top to bottom: subsets of size 
1, 2, 5, 10, 20, 50, 59, 100, 200. 
Note that, for subsets of size 
50 or more, no noticeable dif- 
ference in performance can 
be observed. After rescaling 
each input individually to 
zero mean and unit variance 
we used a Gaussian kernel 
(2.68) with 20? = N = 13 
where N is the dimension 
of the data. The size of the 


0 50.100 150 200 250 300 350 400 450 , 
Number of Basis Functions overall matrix was m = 3000. 


at random, we choose one function at a time depending on which of them will 
decrease Err(a) most and add this function to the set of kernels I chosen so far. 
Next we recompute the residual Err(a) and continue iterating. 

It would be wasteful to compute a full update for œ and (K”+1"+1)-1 for every 
possible candidate since all we are interested in is the change in Err(a). Simple 
(but tedious) algebra yields that, with the definitions k, v, and k of Section 10.2.2, 


Err(a™"+1) =tr K -tr an+ (Kenti) T 


T 
mn{gnnj—1 -1 miera Te T. mn 
=t K— tr n. E is H aly i 
—n} [K""v = k] k 
= Err(a™") — n! | K™v — k|}? . (10.42) 


Hence, our selection criterion is to find that function k(x;, -) for which the decrease 
in the approximation error 77! ||K”"v — |” is largest. This method still has a 
downside — at every step we would have to test for the m — n different remaining 
kernel functions k(x;,-) to find which would yield the largest improvement. With 
this the cost per iteration would be O(mn(m — n)) which is clearly infeasible. 

The key trick is not to analyze the complete set of m — n functions but to pick a 
random subset instead. Section 6.5.1 and in particular Theorem 6.33 tell us that a 
random subset of size N = 59 is sufficiently large to yield a function k(x;, :) which 
is, with 95% confidence, better than 95% of all other kernel functions. 

Figure 10.2 shows that, in practice, subsets of 59 yield results almost as good as 
when a much larger set or even the complete dataset is used to find the next basis 
function. Note the rapid decay in 1n Err(a) = In tr (K — K). 

We conclude this section with the pseudocode (Algorithm 10.1) needed to find 
a sparse greedy approximation of K and k(x;,-) in the Reproducing Kernel Hilbert 
Space H. Note that the cost of the algorithm is now O(Nmn?), hence it is of the 


10.2 Sparse Greedy Matrix Approximation 293 


Algorithm 10.1 Sparse Greedy Matrix Approximation (SGMA) 


input basis functions k;, bound on residuals € 
n=0,l={},a=[],k™ =0 
repeat 
n++ 
Draw random subset M of size N from [m]\I 
{Select best basis function} 
for all j € M do 


v= (K"")1k 

n=kK—-v'k 

Err(a") = Err(q"""+1) = n || K""v va k|/? 
end for 


Select best 7 € M and update I = I U {7}. 
Update (K"+!"+1)-1 and a"+" from (10.38) and (10.39). 
Update Err(q™""*1) 
until Err(a™”+!) < € 
output n,a, I, Er{a™"”+!) 


same order as an algorithm choosing kernel functions at random.” 
10.2.4 Experiments 


To illustrate the performance of Sparse Greedy Matrix Approximation (SGMA) we 
compare it with a conventional low-rank approximation method, namely PCA. 
The latter is optimal in terms of finite dimensional approximations (see Prob- 
lem 10.7), however, it comes at the expense of requiring the full set of basis func- 
tions. We show in experiments that the approximation rates when SGMA is used 
are not much worse than those obtained with PCA. Experimental evidence also 
shows that the generalization performance of the two methods is very similar (see 
[513] for more details). Figure 10.3 shows that under various different choices of a 
Hilbert space (we varied the kernel width c) the approximation quality obtained 
from SGMA closely resembles that of PCA. 

Since SGMA picks individual basis functions k(x;, -) with corresponding obser- 
vations x; we may ask whether the so-chosen x; are special in some way. Figure 
10.4 shows the first observations for the USPS dataset of handwritten digits (Gaus- 
sian RBF kernels with width 207 = 0.5 - N where N = 16 x 16 pixels). Note that 
among the first 15 observations (and corresponding basis functions) chosen on the 
USPS database, all 10 digits appear at least once. The pair of ones is due to a hori- 
zontal shift of the two images with respect to each other. This makes them almost 


5. If the speed of prediction is not of utmost importance, random subset selection instead 
of the ‘59-trick’ may just be good enough, since it is N = 59 times less expensive per basis 
function but will typically use only four times as many basis functions. With a run-time 
which is quadratic in the number n of basis functions this may lead to an effective speed 
up, at the expense of a larger memory footprint. 
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orthogonal to each other in feature space — their bitmaps hardly overlap at all. 

It is still an open question how and whether the good approximation qualities 
shown in Figures 10.3 can be guaranteed theoretically (3 orders of magnitude 
approximation with fewer than 10% of the basis functions). This need not be of 
any concern for the practitioner since he or she can always easily observe when 
the algorithm works (yet a theoretical guarantee would be nice to have). 

In practice, generalization performance is more important than the question 
of whether the initial class of functions can be approximated well enough. As 
experimental evidence shows [514] the size of tr(K — K), that is, the residual 
error, is conservative in determining the performance of a reduced rank estimator. 
For modest values of approximation, such as 2 orders of magnitude reduction in 
tr(K — K), the performance is as good as the one of the estimator obtained without 
approximations. In some cases, such a sparse approximation may provide better 
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performance. 


A practical use of SGMA lies in the fact that it allows us to find increasingly 
accurate, yet sparse approximations K which can subsequently be used in opti- 
mization algorithms as replacements for K. It is also worth noting that the ap- 
proximation and training algorithms can be coupled to obtain fast methods that 
do not require computation of K. Many methods find a dense expansion first and 
only subsequently employ a reduced set algorithm (Chapter 18) to find a more 
compact representation. 


10.3 Interior Point Algorithms 


Interior point algorithms are some of the most reliable and accurate optimization 
techniques and are the method of choice for small and moderately sized problems. 
We will discuss approximations applicable to large-scale problems in Sections 
10.4, 10.5 and 10.6. We assume that the reader is familiar with the basic notions 
of interior point algorithms. Details can be found in Section 6.4 and references 
therein. In this section we focus on Support Vector specific details. 

In order to deal with optimization problems which have both equality con- 
straints and box constrained variables, we need to extend the notation of (6.72) 
slightly. The following optimization problem is general enough to cover classifica- 
tion, regression, and novelty detection: 


minimize sa'Qa+cla 

a, 

subjectto Aa=d (10.43) 
{0<a<u}or{a+t=uanda,t > 0}. 


Here Q is a square matrix, typically of size m x m or (2m) x (2m), c,a,t,u are 
vectors of the same dimension, and A is a corresponding rectangular matrix. The 
dual can be found to be 


minimize —4a'Qa+d'h—u's 
S42 5Y 
subjectto Qa+c—A'h+s—z=0 (10.44) 


s,z >Oandh free 
Furthermore, we have the Karush-Kuhn-Tucker (KKT) conditions 
Qizi = 0 and s;t; = 0 for alli € [m] (10.45) 
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which are satisfied at optimality. 

Note that besides the primal variables a and t of the optimization problem 
(10.43), also the dual variables, such as s,z, and h carry information. Recall the 
reasoning of Section 6.3.3, where we showed that the dual-dual of a linear program 
is the original linear program. One may exploit this connection in the context of 
(10.43) and (10.44), since the latter is similar to the initial problem of minimizing 
regularized risks. In particular, the Aa = d stems from the free variables (such as b 
or the semiparametric coefficients /3;). Consequently the dual variables h of (10.44) 
agree with b (or the semiparametric coefficients (;). In practice this means that 
we need not perform any additional calculation in order to obtain b if we use an 
interior point code. 


10.3.1 Solving the Equations 
As in Section 6.4.2 we solve the optimization problem by simultaneously satisfying 


primal and dual feasibility conditions and a relaxed version of (10.45). Lineariza- 
tion by analogy with (6.82) leads to 


AAa = d-—Aa = pp 

Aa + At = u-a-t = pp 
QAa—ATAh+As—Az = —Qa-c+Alh—-s+z = py (10.46) 
a; 'z;AQ; + Az; = uar’ a ae a; AajAz; =: PKKTI 

tr s,At; + As; = jit =o) ty At;As; =! PKKT2 


Solving for At, Az, As yields 


At = Pp2 — Aa 
Az; = pxxri — a7 'z;Aa; (10.47) 
As; = pker +t 's;Aa; 


and the reduced KKT system (see (6.83)) 


(Q+ diag(t™'s +a™tz)) —AT 
—A 0 


Aa _ | Pa + PKKT1 — PKKT2 
Ah Ppl 
Eq. (10.48) is best solved by a standard Cholesky decomposition of the upper 


left submatrix and explicit solution for the remaining parts of the linear system.® 
For details of the predictor and corrector iteration strategies, the updates of ju, 


(10.48) 


6. Pseudocode for a Cholesky decomposition can be found in most numerical analysis 
textbooks such as [423]. If ( Q+diag(t-'s + a7'z)) should happen to be ill-conditioned, as 
may occur in rare cases during the iterations then it is recommended to use the pseudo- 
inverse or the Bunch-Kaufman decomposition [83] as a fall-back option. Linear algebra 
libraries such as LAPACK typically contain optimized versions of these algorithms. 
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convergence monitoring via the KKT gap and regarding initial conditions we refer 
the reader to Section 6.4. 


10.3.2 Special Considerations for Classification 


We now consider the particular case of SV classification and assign values to 
Q,c,A,and u. For standard SV classification we obtain 


Qij = yiyjk(xi, xj) where Q e R"*” 

c = (1,...,1) where c € R” 

u = (C,...,C)=(4,.-., 4) where u € R” (10.49) 
A = (Y1,---;Ym) where A € R” 

d = 0 where b € R 


For v-classification (see Section 7.2), the parameters A and d are changed to 
A = Y1,- -Ym 
Tissugl 


2 = [2]. 


In addition, we now can give meaning to the variables t, z, and s. For each a; there 
exists a dual variable z; for which a;z; — 0 as the iteration advances, due to the 
KKT conditions. This means that either z; — 0 or a; — 0. The practical consequence 
of this is that we can eliminate a; before the algorithm converges completely. 
All we need do is look at the size of the entries a; 'z;. If they exceed a certain 
threshold, say c, we eliminate the corresponding set of variables (t;, Zi, Si, a;) from 
the optimization process, since the point will not be a Support Vector (see also 
Section 10.4 for details on how the optimization problem changes). 

The advantage is twofold; first, this reduces computational requirements since 
the size of the matrix to be inverted decreases. Second, the condition of the re- 
maining reduced KKT system improves (large entries on the main diagonal can 
worsen the condition significantly) which allows for faster convergence. A similar 
reasoning can be applied to t; and s;. Again, if t7's; —> co this indicates that the 
corresponding patterns will be a Support Vector with, however, the coefficient a; 
hitting the upper boundary u;. Elimination of this variable is slightly more compli- 
cated since we have to account for it in the equality constraints Aa = d and update 
d accordingly. 

As far as computational efficiency is concerned, plain interior point codes with- 
out any further modifications, are a good choice for data sets of size up to 10° — 104 
since they are simple to implement, reliable in convergence, and very precise (the 
size of the KKT gap is several orders of magnitude smaller than what could be 
achieved by SMO or other chunking algorithms). Fast numerical mathematics 
packages such as BLAS [316, 145] and LAPACK [11] are crucial in this case though. 


(10.50) 
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Beyond that, the memory footprint and the computational cost of performing a 
Cholesky decomposition of Q is too expensive. We will show in Section 10.3.4 
how to overcome this restriction. 


10.3.3 Special Considerations for SV Regression 


In the generic case we have 


K -K ; 
Q = where K € R% 
—K K 
c€ = (Yt Eye ee Ym — E, —Y1— E, ...; —Ym— E) where c € R” (10.51) 
u = (C,...,C)=(4,...,4) where u € R” l 
A = (1,...,1,—1,...,—1) where A € R” 
d = 0 


The constraint matrix A changes analogously to (10.49) if we use v regression, that 
is, we have two constraints rather than one. Another minor modification allows 
us to also deal with Huber’s robust regression. All we need do is define Q as 
Q= re = , where D is a positive definite diagonal matrix. Since the 
-K K+D! 

reduced KKT system only modifies Q by adding diagonal terms, we can solve 
both Huber’s robust regression and the generic case using identical code. 

The key trick in inverting Q and the matrices derived from Q by addition to 
the main diagonal is to exploit the redundancy of its off-diagonal elements. By an 
orthogonal transform 


1 1 1 
o= a4 (10.52) 
one obtains 
DK a D-+-D' D-DD’ 
o'QO0= et poils (10.53) 
ne = 


Therefore, the OT QO system can be inverted essentially by inverting an m x m 
matrix instead of a2m x 2m system. “Essentially,” since in addition to the inversion 
of 2K + Dp we must solve for the diagonal matrices D + D’. The latter is simply 
an operation of computational cost O(m). Furthermore, Q7! = O(O'QO)7!0'. 

All other considerations are identical to those for classification. Hence we can 
use the slightly more efficient matrix inversion with (10.53) as a drop-in replace- 
ment of a more pedestrian approach. This is the additional advantage we can gain 
from a direct implementation instead of using an off the shelf optimizer such as 
LOQO [556] or CPLEX [117]; in general the latter will not be able to exploit the 
special structure of Q. 

Finally, note that by solving the primal and dual optimization problem simulta- 
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neously we also compute parameters corresponding to the initial SV optimization 
problem. This observation is useful as it allows us to obtain the constant term b 
directly, namely by setting b = h. See Problem 6.13 for details. 


10.3.4 Large Scale Problems 


The key stumbling block encountered in scaling interior point codes up to larger 
problems is that Q is of the size of m. Therefore, the cost of storing or of inverting Q 
(or any matrices derived from it) is prohibitively high. A possible solution to large 
problems is to use an efficient storage method for Q by low rank approximations. 


Linear Systems Assume for a moment that we are using a linear SV kernel only, 
namely k(x, x’) = (x, x’). In this case K = XX! where X € R”*” is (with slight abuse 
of notation) the matrix of all the training patterns, that is X;; = (x;);. Ferris and 
Munson [168] use this idea to find a more efficient solution for a linear SVM. They 
employ the Sherman-Woodbury-Morrison formula [207] (see also Section 16.4.1 
for further applications) which gives 


(V+4RAR")\" = V7 — VIR(H! ER VOR) RY, (10.54) 


to invert (Q + diag(t7's + a7'z)) efficiently (see Problem 10.10). In particular they 
use out-of-core storage of the data (i.e. storage on the hard disk) in order to be able 
to deal with up to 10° samples. 


Nonlinear Systems For nonlinear SVMs, unfortunately, an application of the 
same technique is not directly feasible since storage requirements for K are enor- 
mous. But we can rely on the matrix approximation techniques of Section 10.2 
and approximate K by K = K""(K"")-1(K™")'. We only give the equations for SV 
classification where Q = K. The extension to regression is straightforward — all 
we must do is apply the orthogonal transform (10.53) beforehand. The problem is 
then solved in O(mn7?) time since 


—1 
EET ET TE D) = (10.55) 
D7! _ D71K™ (x™ 4 (KTDK) =l (KID 


The expensive part is the matrix multiplications with K”” and the storage of K””. 
As before, in the linear case, we resort to out-of-core storage (we write the matrix 
K™” to disk once it has been computed). By this we obtain a preliminary solution 
using a subset consisting of only n basis functions. 


Iterative Improvement If further precision is required we need to decrease the 
size of the problem by eliminating patterns for which the corresponding value 
of a can be reliably predicted. We can do this by dropping those patterns which 
have the largest distance from the boundary (either with a; = 0 or a; at the upper 
constraint). These patterns are equivalent to those with the largest corresponding 
dual Lagrange multipliers, since these are least likely to change their value when 
we increase the precision of our method (see Section 10.3.2). The reduced size m of 
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the training set then allows us to store matrices with larger n and to continue the 
optimization on the smaller dataset. This is done until no further minimization 
of the regularized risk functional Ryeg[f] is observed, or until the computational 
restrictions of the user are met (the maximum number of basis functions we are 
willing to use for the solution of the optimization problem is attained). 


Parallelization Note that this formulation lends itself to a parallel implementation 
of a SV training algorithm since the matrix multiplications can readily be imple- 
mented on a cluster of workstations by using matrix manipulation libraries such 
as SCALAPACK [144] or BLACS [146]. 


Stopping Rule Finally, if desired, we could also use the value of the primal ob- 
jective function (10.43), namely ża" Qa +c! a, with K rather than K as an upper 
bound for the minimum of (10.43). This is due to Theorem 10.2 which states that 
a! Ka < a! Ka. We may not always want to use this bound, since it requires com- 
putation of K which may be too costly (but given that it is provided only as a 
performance guarantee this may not be a major concern). 


10.4 Subset Selection Methods 


Training on SVs 


In many applications the precision of the solution (in terms of the Lagrange 
multipliers a;) is not a prime objective. In other situations we can reasonably 
assume that a large number of patterns will either become SVs with a; constrained 
to the boundary, or will not become SVs at all. In any of these cases we may 
break the optimization problem up into small manageable subproblems and solve 
these iteratively. This technique is commonly referred to as chunking, or subset 
selection. 


10.4.1 Chunking 


The simplest chunking idea, introduced in [559], relies on the observation that 
only the SVs contribute to the final form of the hypothesis. In other words — if we 
were given only the SVs, we would obtain exactly the same final hypothesis as if 
we had the full training set at our disposal (see Section 7.3). Hence, knowing the 
SV set beforehand and, further, being able to fit it (and the dot product matrix) 
into memory, one could directly solve the reduced problem and thereby deal with 
significantly larger datasets. 

The catch is, that we do not know the SV set before solving the problem. The 
heuristic is to start with an arbitrary subset; a first chunk that fits into memory. We 
then train the SV algorithm on it, keep the SVs, while discarding the non-SV data 
in the chunk, and replace it with data on which the current estimator would make 
errors (for instance, data lying outside the £-tube of the current regression). The 
system is then retrained and we keep on iterating until the KKT-conditions are 
satisfied for all samples. 
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10.4.2 Working Set Algorithms 


The problem with chunking is that the strategy will break down on datasets where 
the dot-product matrix built from SVs cannot be suitably kept in memory. A 
possible solution to this dilemma was given in [398]. The idea is to have only a 
subset of the variables as a working set, and optimize the problem with respect 
to them while freezing the other variables. In other words, to perform coordinate 
descent on subsets of variables. This method is described in detail in [398, 266] 
for the case of pattern recognition. Further information can be found in [459], for 
example. 

In the following we describe the concept behind optimization problems of type 
(10.43). Subsequently we will adapt it to regression and classification. Let us 
assume that there exists a subset Sw C [m], also referred to as the working set, which 
is an index set determining the variables a; we will be using for optimization 
purposes. Next denote by S; = [m]\Sw the fixed set of variables which will not 
be modified in the current iteration. Finally, we split up Q,c, A, and u accordingly 
into 


Quww Qrw | 
Q= (10.56) 
| Que Qf 


and c = (Cw,¢/), A = | Aw Ay |, u = (utp, Uy). In this case (10.43) reads as 


es T 
minimize 50:4, Quwûw + [Cw + Qusas| at 


Qw 
LT zy 
[za] Qraf + Cy | (10.57) 
subjectto Awaw =d — Apap 
{0 < Aw < Uw} or {aw + ty = Uw and Qw, tw > 0} 


This means that we recover a standard convex program in a subset of variables, 
with the only difference being that the linear constraints and the linear contribu- 
tion have to be changed due to their dependency on aș. For the sake of complete- 
ness we keep the constant offset produced by ay but it need not be taken into 
account during the actual optimization process. 

Minimizing (10.57) with respect to aw» will also decrease (10.43). In particular, 
the amount of progress on the subset is identical to the progress on the whole’. 
For these subsets we may use any optimization algorithm we prefer; interior point 
codes for example. Algorithm 10.2 contains the details. The main difficulty in 
implementing subset selection strategies is how to choose Sw and Sps. We will 
address this issue in the next section. 

Under certain technical assumptions (see [289, 96]) Algorithm 10.2 can be shown 
to converge to a global minimum. Several variants of such working set algorithms 


7. Such a decrease will always occur when the optimality conditions in (10.57) are not 
satisfied and, of course, we choose working-sets with this property. 
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Algorithm 10.2 Subset Selection 
input kernel k, data X, precision e€ 
Initialize aj, a7 = 0 
repeat 
Compute coupling terms Qyray and Apay for Sw. 
Solve reduced optimization problem on Sw. 
Choose new S,, from variables a;, a; not satisfying the KKT conditions. 
Compute bound on the Error in computing the minimum (e.g. by the KKT Gap) 
until Error < € or Sw = Ú 


have been proposed [284, 266, 398, 459, 108], most of them with slightly different 
selection strategies. In the following we focus on the common features among 
them and explain the basic ideas. For practical implementations we recommend 
the reader look for the most recent information since seemingly minor details in 
the selection strategy appear to have a significant impact on the performance. It is 
important to also be aware that the performance of these heuristics often depends 
on the dataset. 


10.4.3 Selection Strategies 


Recall Proposition 10.1, and in particular (10.6) and (10.8). For classification, the 
terms ôg c(max(0,1 — yjf(x;))) and aj(yjf(xj) — 1) give an upper bound on the 
deviation between the current solution and the optimal one. A similar relation 
applies for regression. 

The central idea behind most selection strategies is to pick those patterns whose 
contribution to the size of the KKT Gap (10.13), (10.18) is largest. This is a reason- 
able strategy since, after optimization over the working set Sw, the correspond- 
ing terms of the KKT Gap will have vanished (see Problem 10.11). Unfortunately 
though this does not guarantee that the overall size of the KKT gap will diminish 
by the same amount, since the other terms KKT; for i € S; may increase. Still, we 
will have picked the set of variables for which the KKT gap size for the restricted 
optimization problem on Sw is largest. Note that in this context we also have to 
take the constraint Aa = d into account. This means that we have to select the 
working set Sw such that 


On = {aw|Avaw = d = Arar} N {aw|0 < Qw < C} (10.58) 


is sufficiently large or, at least, does not only contain one element (see Figure 10.5 
for the case of standard SV classification). 

As before we focus on classification (the regression case is completely analo- 
gous), and in particular on soft margin loss functions (10.6). 


KKT Selection Criterion Since € = max(é,0) — max(0, —€) we can rewrite KKT; 
as 


KKT; = (C — a;)max(0, 1 — y; f (x:)) + aimax(0, yif(x;) — 1). (10.59) 
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Figure 10.5 Constraints on the working set Sw in the case of classification. In all cases the 
box constraints 0 < a; < C apply. Left: two patterns of different classes result in (+1)ay + 
(—1)a2 = c. Middle: the same case with c = C, thus the only feasible point is (C,0). Right: 
two patterns of the same class with a; + @ < C. 


The size of KKT; is used to select which variables to use for optimization purposes 
in the RHUL SV package [459]. 


Dual Constraint Satisfaction Other algorithms, such as SVMLight [266] or SVM- 
Torch [108], see also [502], choose those patterns 7 in the working set for which 


KKT; = H(C — aj) max(0, 1 — yif (x:)) + H(ai) max(0, yif (xi) — 1) (10.60) 


is maximal’, where H denotes the Heaviside function; 


Ho =| T (10.61) 


0 otherwise 


As one can see, CKKT; is clearly an upper bound on KKT;, and thus, if in the 
process of the optimization }}; KKT; vanishes, we can automatically ensure that 
the KKT gap will also decrease. The only problem with (10.60) is that it may be 
overly conservative in the stopping criterion and overestimate the influence of the 
constraint on f(x;), when the corresponding constraint on a; is almost met. 

A quite similar (albeit less popular) selection rule to (10.60) can be obtained from 
considering the gradient of the objective function of the optimization problem 
(10.43), namely Qa + c. In this case we want to search in a direction d where 
d'(Qa +c) is large. 

Primal Constraint Satisfaction Equations (10.59) and (10.60) suggest a third se- 
lection criterion, this time based only on the behavior of the Lagrange multipliers 
aj. Rather than applying the Heaviside function to them we could also apply it to 


8. This may not be immediately obvious from the derivations in [266] and [108], since their 
reasoning is based on the idea of [618] that we should select (feasible) directions where 
the gradient of the objective function is maximal. An explicit calculation of the gradient, 
however, reveals that both strategies lead to the same choice of a working set. 
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Table 10.1 Subset selection criteria for classification. 


(C — aj) max(0,1 — yi f (x)) +aimax(0, yi f(x) — 1) 
H(C — ai)max(0, 1 — yi f(x:)) + H(ai)max(0, yi f(x) — 1) 


(C= aH = yif (x) + a Hyf) =D 


the factors depending on f(x;) directly. Since H(max(0, €)) = H(€) we obtain 
KKT; = (C — aH (1 — yif (xi) + aA (yif (xi) — 1). (10.62) 


In other words, only those patterns where the KKT conditions are violated and 
the Lagrange multipliers a; differ from their optimal values by a large amount are 
chosen. 


Summarizing, we may either select patterns (i) based on their contribution to the 
size of the KKT gap, (ii) based on the size of the gradient of the objective function 
at the particular location, or (iii) depending on the mismatch of the Lagrange 
multipliers. Table 10.1 summarizes the three strategies. Overall, strategy (i) is 
preferred, since it uses the tightest of all three bounds on the minimum of the 
objective function. 


Feasible Directions In all cases we need to select a subset such that both the 
constraints Aa = d and the box constraints on a are enforced. A direction e with 
Ae =0Oand0<a+de < C for some 6 > 0, that is a direction taking only points into 
account where the KKT conditions are violated, will surely do. In order to keep 
memory requirements low we will only choose a small subset Sw C {1,...,m} of 
size typically less than 100. 


Balanced Sample To obtain a relatively large search space while satisfying Aa = 
d, we keep only coordinates d; (and corresponding aj) at which the gradient points 
away from the (possibly active) boundary constraints. Since A, in general, has a 
rather simple form (only entries 1 and —1) this requirement can be met by selecting 
a direction e where the signs of the corresponding entries in A alternate. 

This can mean, for example for classification, that an equal number of positive and 
negative samples need be selected. For v-SVM we additionally have to balance be- 
tween points on either side of the margin within each class. The case of regression 
is analogous, with the only difference being that we have two “margins” rather 
than one. 


Summing up, a simple (and very much recommended) heuristic for subset selec- 
tion is to pick directions where the gradient Qa + c is largest, the KKT conditions 
for the corresponding variables are violated, and where the number of samples of 
either class and relation to the margin is balanced. This is what is done in [266, 108]. 
A proof of convergence can be found in [330]. One can show that the patterns se- 
lected by the gradient rule are identical to that chosen by (10.60) according to KKT; 
(see Problem 10.13). 
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One algorithm, Sequential Minimal Optimization (SMO), introduced by Platt 
[409], puts chunking to the extreme by iteratively selecting subsets of size 2 and 
optimizing the target function with respect to them. It has been reported to be sev- 
eral orders of magnitude faster on certain data sets (up to a factor of 1000) and 
to exhibit better scaling properties (typically up to one order better) than classical 
chunking (Section 10.4.1). The key point is that for a working set of 2 the optimiza- 
tion subproblem can be solved analytically without explicitly invoking a quadratic 
optimizer’. 

SMO is one of the most easily implementable algorithms and it has a very be- 
nign memory footprint, essentially only of the size of the number of samples. In 
Section 8.4, we considered the special case of single-class problems; we now de- 
velop the classification and regression cases. This development includes a treat- 
ment of pattern dependent regularization and details how the algorithm can be 
extended to more general convex loss functions. 

The exposition proceeds as follows; first we solve the generic optimization 
problem in two variables (Section 10.5.1) and subsequently we determine the value 
of the placeholders of the generic problem in the special cases of classification and 
regression. Finally we discuss how to adjust b properly and we determine how 
patterns should be selected to ensure speedy convergence (Section 10.5.5). 


10.5.1 Analytic Solutions 


We begin with a generic convex constrained optimization problem in two vari- 
ables (for regression we actually have to consider four variables — aj, a}, aj, O}, 
however, only two of them may be nonzero simultaneously). By analogy to (10.43) 
and (10.57) we have 


minimize 5 [a? Qi: + Qj; + 2aia;Q;| + CjQj + Cjaj 
Oj Qj 


subject to sajt+aj=7 (10.63) 


0<a;=Cand0< a; < Cj. 


Here c;,cj, y € R, s € {+1}, and Q € R’*? are chosen suitably to take the effect 
of the m — 2 variables that are kept fixed into account. The constants C; represent 


9. Note that in the following we will only consider standard SV classification and regres- 
sion, since most other settings (an exception being the single-class algorithm described in 
Section 8.4) have more than one equality constraint and would require at least (the number 
of equality constraints + 1) variables per iteration in order to make any progress. In such a 
case (a) the difficulty of selecting a suitable set of directions would increase significantly and 
(b) the computational cost incurred by performing an update in a one-dimensional space 
would increase linearly with the number of constraints rendering SMO less attractive. See 
also Problem 10.14. 
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pattern dependent regularization parameters (as proposed in [331], among others, 
for unbalanced observations). Recall that we consider only optimization problems 
with one equality constraint. The following auxiliary lemma states the solution of 
(10.63). 


Lemma 10.3 (Analytic Solution of Constrained Optimization) Assume we have an 
optimization problem of type (10.63). Further, denote by 


C= scj — Ci + YSQjj — YQij (10.64) 
X= Qi + Qjj — 25Qij (10.65) 
two auxiliary variables derived from (10.63). Then, for x = 0 we have 
L i 0 
ai= e> ; (10.66) 
H otherwise 


and for x > 0 we obtain a; = min(max(L, x~'¢), H). The case of x < 0 never occurs. 
Furthermore aj = y — sa; and L, H are defined as 


-Vy¥_C)) i 
_ max(0,s i Cj) ifs> , (10.67) 
max(0,s~°7) otherwise 
r pee , 
_ J min(Ci, s) ifs we (10.68) 
max(C;,s~'(y—C))) otherwise 


Proof The idea is to remove a; and the corresponding constraint from the op- 
timization problem and to solve for a;. We begin with constraints on a; and the 
connection between q; and aj. Due to the equality constraint in (10.63) we have 


aj = 7 — 8a; (10.69) 
and additionally, due to the constraints on aj, 
sa; = y— a; and thus y > sa; > y — Cj. (10.70) 


Since C; > a; > 0 we may combine the two constraints into the constraint H > aj > 

L where H and L are given by (10.67) and (10.68). Now that we have determined 

the constraints we look for the minimum. Elimination of aj; = y — sa; yields 
minimize 507(Qu + Qyj — 28Qij) + alci — scj + Qi — sQ; 
subjectto L< a;< H. 


(10.71) 


We have ignored constant terms independent of a; since they do not influence the 
location of the minimum. The unconstrained objective function, which can also 
be written as Xa? — Ca;, has its minimum at y~'¢. In order to ensure that the 
solution is also optimal for the constraint a; € [L, H] we only have to “clip” the 
unconstrained solution y~!¢ to the interval, i.e. a; = min(max(y~!¢, L), H). This 
concludes the proof. E 


10.5 Sequential Minimal Optimization 307 


During the optimization process it may happen, due to numerical instabilities, that 
the numerical value of x is negative. In this situation we simply reset xy = 0 and use 
(10.66). All that now remains is to find explicit values in the cases of classification 
and regression. 


10.5.2 Classification 
Proposition 10.4 (Optimal Values for Classification) In classification the optimal 


values of a; and œj are given as follows. Denote by x = Ki + Kjj — 2yiyjKij, 8 = yi¥;, 
and let L, H be defined as in 


Povey yd 


L = max(0, as" + agd =) = max(0, a% — alt) 


i 


H = min(C;, at +a”) = min(C;, Cj + a9" — a4) 


We have a; = min(max(a,L, H)) and a; = s(a% — a;i) — asa, With the auxiliary 


definition 6 := yi((f(x;) — yj) — (f(x) — yi), & is given by 


a+ ifx>0 
Ā&=4 —oo if x =0and ô > 0 (10.72) 
oo ify =Oandd <0 


This means that the change in a; and a; depends on the difference between the 
approximation errors in i and j. Moreover, in the case that the unconstrained solu- 
tion of the problem is identical to the constrained one (a; = &) the improvement in 
the objective function is given by x~"((f(x;) — yj) — (f(«i) — yi). Hence we should 
attempt to find pairs of patterns (i, j) where the difference in the classification er- 
rors is largest (and the constraints will still allow improvements in terms of the 
Lagrange multipliers). 


Proof We begin with q. In classification we have )", y;a; = 0 and thus y;a; + 


yjaj = yag" + y;a%", or equivalently 
Yiyjai + aj = yiy jar" + agi =: yand s = yiyj. (10.73) 


Now we turn our attention to Q. From (10.56) we conclude that Q;; = Kii, Qj; = K 
and Q;j = Qj; = sK;j. This leads to 


jir 


x= Ki+ Kjj — 2Kij. (10.74) 

Next we compute c; and c;. Eq. (10.57) leads to 

ci=—1+yi (3 ots = yf (xi) — b — yi) — aj Ki — ajsKij (10.75) 
Iži,j 


and similarly for c;. Using y; = yjs we compute Ç as 


Ç = —yi(f (xi) — b — yi) + aiKii + ajsKij + yif (x;)— b — y;) (10.76) 


Implementation 


+ajsKjj + ai Ki; +(a;+ saj)(Kij — Kjj) 
= yil(f (xj) — yj) — (F(x) — yi) + aix (10.77) 


Substituting the values of y, x, and ¢ into Lemma 10.3 concludes the proof. E 
10.5.3 Regression 


We proceed as in classification. We have the additional difficulty, however, that for 
each pair of patterns x; and x; we have four Lagrange multipliers aj, a7 ,a Hf and 
až. Hence we must possibly consider up to three different pairs of solutions 0, Let 
us rewrite the restricted optimization problem in regression as follows 


T 
eee 1 | &i— až Ki Kij Qai — AF 
minimize 5 i 
Oj OY} Aja} Qj Qj Ki; Kjj aj- a; 
Cj Qj — aF 
+ | | telai ta} +aj+ až) (10.78) 
Cj aj aŭ 


subjectto 0< a < Cand 0 < aj < Cř forall! € {i, j} 
(ai — až) + (aj — a’) = (alt — až) + (ass — axis) =. 
Here c;,c; are suitably chosen constants depending solely on the differences a; — 
aj and aj — a}. One can check that 


ci = —yi + (f (xi) — b — Kila — až) — Kij(aj — 05) (10.79) 


and cj likewise. We deliberately keep the contribution due to £ separate since this 
is the only part where sums a; + a* rather than differences enter the equations. 

As in the classification case we begin with the constraints on a; and a*. Due 
to the summation constraint in the optimization problem we obtain (a; — aj) = 
7 — (aj — až). Using the additional box constraints on aj, af of (10.78) leads to 


L := max(y — Cj, -C7) < a; — af < min(y + C}, C;) =: H. (10.80) 
This allows us to eliminate aj, a; and rewrite (10.78) in terms of 3 := aj — a7, 
b 10°(Ki + Kjj — 2K;;) + BOW Kij — K;jj) Sr it oF cj) 
+e(|8| + Iy- 2) 
= 5x + BUF) — ¥) — (Ff) — yj) xla- a) (1081) 


+e([8|+ly— BD) 
subject to L<6<H. 


10. The number of solutions is restricted to four due to the restriction that q; and a; (or 
analogously aj and až) may never both be nonzero at the same time. In addition, the 
constraint that aj — aj + aj — a; = 7 rules out one of these remaining four combinations. 
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Convex, piecewise quadratic function on [-2, 1.5] 
5 T T T T T T 


-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 


Figure 10.6 The minimum of this function occurs at 3 = 0 due to the change in ¢|(3|. 


Here we used the x = Ki; + Kjj — 2yiy;Kij, as in classification. The objective func- 
tion is convex and piecewise quadratic on the three intervals 


I_ :=[L,min(0, 7)], Io := [min(0, y), max(0, 7)], 1+ := [max(0, 7), H] (10.82) 


(for y = 0 the interval Ip vanishes). An example of such a function is given in Figure 
10.6. One can check that the unconstrained minimum of the quadratic objective 
function (10.81), as defined on the intervals I_, Ip, I4, would be given by 


p- A 2 ifBEL 
Bo t= -a yeo =y) (fxi) — y)) + = 0 if8€Iy (10.83) 
b+ —2 ifpel, 


For x = 0 the same considerations as in classification apply; the optimum is found 
on one of the interval boundaries. Furthermore, since (10.78) is convex all we now 
have to do is match up the solutions 6; with the corresponding intervals I;. 

For convenience we start with o and Ip. If Go € In we have found the optimum. 
Otherwise we must continue our search in the direction in which ĝo exceeds Ip. 
Without loss of generality assume that this is I}. Again, if 64 € I, we may stop. 
Otherwise we simply “clip” 84 to the interval boundaries of I. Now we have to 
reconstruct a from {. Due to the box constraints and the fact that ¥ (a; — a¥) = 0 
we obtain 


Qi = max(0, 8), aa = max(0, y — b) 


až = max(0,—§), a; = max(0,—y+ 2). 


In order to arrive at a complete SV regression or classification algorithm, we still 
need a way of selecting the patterns x;,x; and a method specifying how to update 
the constant offset b efficiently. Since most pattern selection methods use b as 
additional information to select patterns we will start with b. 


(10.84) 
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10.5.4 Computing the Offset b and Optimality Criteria 


We can compute b by exploiting the KKT conditions (see Theorem 6.21). For 
instance in classification; at the solution, the margin must be exactly 1 for Lagrange 
multipliers for which the box constraints are inactive. We obtain 


Yif (xi) = yi((w, B(x)) + b) = 1 for a; € (0,C;) (10.85) 
and likewise for regression 

f(x) = (w, ®(x)) +b = y; — £ for a; € (0, Cì) (10.86) 
f(x) = (w, ®(x)) +b = yi + £ for až € (0, C7). (10.87) 


Hence, if all the Lagrange multipliers a; were optimal, we could easily find b by 
picking any of the unconstrained a; or a; and solving (10.85), (10.86), or (10.87). 

Unfortunately, during training, not all Lagrange multipliers will be optimal, 
since, if they were, we would already have obtained the solution. Hence, obtaining 
b by the aforementioned procedure is not accurate. We resort to a technique 
suggested by Keerthi et al. [291, 289] in order to overcome this problem. 

For the sake of simplicity we start with the classification setting; we first split 
the patterns X into the following five sets: 


= fila; € 0,C)} Lo = fila; =0,y;,=41} Lyc= fila; = Ci yi = +1} 
Io = {ilai=0,yi=—-1} Ic = {iļa; = Ci, yi = —1} 


Moreover we define 


Chi ta min x)= 

i Tr eet (i) (10.88) 
e = max xj) = 

lo a (x:) 


Since the KKT conditions have to hold for a solution we can check that this 
corresponds to epi > 0 > eio. For Ip we have already exploited this fact in (10.85). 
Formally we can always satisfy the conditions for ep; and e,, by introducing two 
thresholds: by; = b — ey, and bio = b — eio. Optimality in this case corresponds to 
bhi < bio. Additionally, we may use E (bup + bio) as an improved estimate of b. 

The real benefit, however, comes from the fact that we may use ep; and ew to 
choose patterns to focus on. The largest contribution to the discrepancy between 
ehi and ejo stems from that pair of patterns (i, j) for which 


i€ Io U Io U Lc 


discrepancy(i Xi x; ) where 
pancy(i, j) := (f (xi) = (f(x;)— yj) j€hULpUL¢ 


(10.89) 
is largest. This is a reasonable strategy for the following reason: from Proposi- 
tion 10.4 we conclude ae the potential nie in the variables a;, a; is largest if 
the discrepancy (f(x;) — yi) — (f (x;) — y;) is largest. The only modification is that i 
and j are not chosen eae any more. 
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Finally, we obtain another stopping criterion. Instead of requiring that the vi- 
olation of the KKT condition is smaller than some tolerance Tol we may require 
that ej, < epi holds with some tolerance; ejo < eni — 2 Tol. In addition, we will not 
consider patterns where discrepancy(i, j) < 2 Tol. See [290] for more details and 
pseudocode of their implementation. 

To adapt these ideas to regression we have to modify the sets I slightly. The 
change is needed since we have to add or subtract £ in a way that is very similar 
to our treatment of the classification case, where y; € {+1}. 


1. If a; = 0 at optimality we must have f(x;) — (y; — £) > 0. 
2. For a; € (0,C;) we must have f(x;) — (y; — £) =0. 
3. For a; = C; we get f(x;) — (y; — £) <0. 


Analogous inequalities hold for až. As before we split the patterns X into several 
sets according to 


b = {ila € (0,C)} Lo= fila =0} Lic = filai = Ci} 
Ij = {ilo} €0,C)} I-o= {ila =0} Lc = {ila} = Cf} 


and introduce ehi, 1o by 


epi := min (a, f(x) —(y:— £), ra f(x) — (yi + a) (10.90) 
eg = max ( max fir) —(yi= 2), max fæ- +e): (10.91) 


The equations for computing a more robust estimate of b are identical to the ones 
in the classification case. Note that (10.90) and (10.91) are equivalent to the ansatz 
in [494], the only difference being that we sacrifice a small amount of numerical 
efficiency for a somewhat simpler definition of the sets I (some of them are slightly 
larger than in [494]) and the rules regarding which ep; and ej are obtained (the 
cases a; = 0, a; = C7 and a; = 0, a; = C; are counted twice). 

Without going into further details, we may use a definition of a discrepancy like 
(10.89) and then choose patterns (i, j) for optimization where this discrepancy is 
largest. See the original work [494] for more details. Below we give a simpler (and 
slightly less powerful) reasoning. 


10.5.5 Selection Rules 


The previous section already indicated some ways to pick the indices (i, jf) such 
that the decrease in the objective function is maximized. We largely follow the 
reasoning of Platt [409, Section 12.2.2]. See also the pseudocode (Algorithms 10.3 
and 10.4). 

We choose a two loop approach to maximizing the objective function. The outer 
loop iterates over all patterns violating the KKT conditions, or possibly over those 
where the threshold condition of the previous section (using epi and e) is violated. 
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Usually we first loop only over those with Lagrange multipliers neither on the 
upper nor lower boundary. Once all of these are satisfied we loop over all patterns 
violating the KKT conditions, to ensure self consistency on the complete dataset. 
This solves the problem of choosing the index i. 

It is sometimes useful, especially when dealing with noisy data, to iterate over 
the complete KKT violating dataset before complete self consistency on the subset 
has been achieved. Otherwise considerable computational resources are spent 
making subsets self consistent that are not globally self consistent. The trick is 
to perform a full sweep through the data once only less than, say, 10% of the non 
bound variables change!!. 

Now to select j: To make a large step towards the minimum, one looks for large 
steps in aj. Since it is computationally expensive to compute y for all possible pairs 
(i, j) one chooses a heuristic to maximize the change in the Lagrange multipliers 
a; and thus to maximize the absolute value of the numerator in the expressions 
(10.72) and (10.83). This means that we are looking for patterns with large differ- 
ences in their relative errors f(x;) — y; and f(x;) — yj. The index j corresponding to 
the maximum absolute value is chosen for this purpose. 

If this heuristic happens to fail, in other words if little progress is made by 
this choice, all other indices j are looked at (this is what is called “second choice 
hierarchy” in [409]) in the following way. 


1. All indices j corresponding to non-bound examples are looked at, searching for 
an example to make progress on. 


2. In the case that the first heuristic was unsuccessful, all other samples are ana- 
lyzed until an example is found where progress can be made. 


3. If both previous steps fail, SMO proceeds to the next index i. 


For a more detailed discussion and further modifications of these heuristics see 
[409] and [494, 291]. 


10.6 Iterative Methods 


Many training algorithms for SVMs or similar estimators can be understood as it- 
erative methods. Their main advantage lies in the simplicity with which they can 
be implemented. While not all of them provide the best performance (plain gradi- 
ent descent in Section 10.6.1) and some may come with restrictions on the scope 
of applications (Lagrangian SVM in Section 10.6.2 can be used only for quadratic 
soft-margin loss), the algorithms presented in this section will allow practition- 
ers to obtain first results in a very short time. Finally, Section 10.6.3 indicates how 
Support Vector algorithms can be extended to online learning problems. 


11. This modification is not contained in the pseudocodes, however, its implementation 
should not pose any further problems. See also [494, 291] for further pseudocodes. 
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Algorithm 10.3 Pseudocode for SMO Classification 


function TakeStep(i, j) 

if i = j then return 0 

8 = YiYj 

if s = 1 then 
L= max(0, a;i + aj — Cj) 
H= min(C;, a; T aj) 

else 
L = max(0, a; — aj) 
H= min(C;, C; + ai aj) 

end if 

if L = H then return 0 

x = Ky + Kj —2K;j 

if x > 0 then 
a= ai +x" yi((F (xj) — yj) — (Fu) — yi) 
ā = min(max(a, L), H 

else if y;((f(x;) — yj) — (f(x) — yi) < 0 then 


ā=H 
else 

a@=L 
end if 


if |a; — ā| < e(¢ + @+ ai) then return 0 
aj += s(a; — &) and q; = & (note: x += y means x = x + y) 
Update b 
Update f(x1),...5 f(Xm) 
return 1 
end function 


function ExamineExample(i) 
KKT; = H(a;) max(0, yi f (x;) — 1) + H(1 — aj) max(0, 1 — yi f (xi) 
if KKT; > Tol then 
if Number of nonzero and non bound a; > 1 then 
Find j with second choice heuristic 
if TakeStep(i, j) = 1 then return 1 
end if 
for all a; > 0 and a; < C; (start at random point) do 
if TakeStep(i, j) = 1 then return 1 
end for 
for all remaining a; do 
if TakeStep(i, j) = 1 then return 1 
end for 
end if 
return 0 
end function 


main SMO Classification(k, X, Y,<) 
Initialize a;, af = 0 and b = 0, make X, Y, a global variables 
ExamineAll = 1 
while NumChanged > 0 or ExamineAll = 1 do 
NumChanged = 0 
if ExamineAll = 1 then 
for all a; do NumChanged += ExamineExample(i) 
else 
for all a; > 0 and a; < C; do NumChanged += ExamineExample(i) 
end if 
if ExamineAll = 1 then 
ExamineAll = 0 
else if NumChanged = 0 then 
ExamineAll = 1 
end if 
end while 


end main 
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Algorithm 10.4 Pseudocode for SMO Regression 


function TakeStep(i, j) 
if i= j then return 0 
q= (ai - až) + (aj — až) 
L= max(y — Cj, —Cř) and H = min(y + C}, Ci) 
if L = H then return 0 
l = min(y,0) and h = max(y,0) 
X= Ki + Kjj — 2Kij 
if x > 0 then 
Bo = (ai = 0%) +X UG) = y) = E) y) 
b+ = Bo — 25 and B_ = fo +25. 
B =max(min(fo,/1),!) (clip Bo to To) 
if § = h then 8 = max(min(G, H),h) 
if 6 = l then 8 = max(min(6_,/), L) 
else if (f(x;) — yi) — (f (xj) — yj) < 0 then 
=h 


if (f(x) — yi) — F (x;) — yj) + 2e < 0 then 6 =H 
else 

B=l 

if (F(x) y) — (Fl) — yj) —2e > Othen 6 =L 
end if 
if |8 — (a; — a%)| < ele + |8| + a; + 0%) then return 0 
a; = max(8, 0), aj = max(—,0), and aj = max(0, y — 6), max(0, —y + Ø) 
Update b 
Update fa), a sf €m) 
return 1 

end function 


function ExamineExample(i) 
KKT; = H(a;)max(0, f(x;)— (yi — €)) + Haj) max(0, (y; + £) — f(x:))+ 
H(C; — a;)max(0, (y; — €) — f(x;)) + H(CF — a7) max(0, f(x;) — (y; + €)) 
if KKT; > Tol then 
if Number of nonzero and non bound q; > 1 then 
Find j with second choice heuristic 
if TakeStep(i, j) = 1 then return 1 
end if 
for all a; > 0 and aj < C; (start at random point) do 
if TakeStep(i, j) = 1 then return 1 
end for 
for all remaining aj do 
if TakeStep(i, j) = 1 then return 1 
end for 
end if 
return 0 
end function 


main SMO Regression(k, X, Y, £) 
Initialize a;, af = 0 and b = 0 
ExamineAll = 1 
while NumChanged > 0 or ExamineAll = 1 do 
NumChanged = 0 
if ExamineAll = 1 then 
for all aj do NumChanged += ExamineExample(i) 
else 
for all a; > 0 and a; < C; do NumChanged += ExamineExample(i) 
end if 
if ExamineAll = 1 then 
ExamineAll = 0 
else if NumChanged = 0 then 
ExamineAll = 1 
end if 
end while 


end main 
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10.6.1 Gradient Descent 


Most of the methods in this chapter are concerned with the dual optimization prob- 
lem of the regularized risk functional. It is, however, perfectly legitimate to ask 
whether or not a primal optimization approach would also lead to good solutions. 
The maximum margin perceptron of Kowalczyk [309] for instance follows such an 
approach. Another method which can be understood as gradient descent is Boost- 
ing (see [349, 179]). 

It is important to keep in mind that the choice of parametrization will have 
a significant impact on the performance of the algorithm (see [9] for a discus- 
sion of these issues in the context of Neural Networks). We could either choose 
to compute the gradient in the function space (thus the Reproducing Kernel 
Hilbert Space H) of f, namely O;Rreg[f], or choose a particular parametrization 
f(x) = Xi aik(x;, x) and compute the gradient with respect to the parameters aj. 
Depending on the formulation we obtain different (and variably efficient) algo- 
rithms. We also briefly mention how the kernel AdaTron [183] fits into this context. 
For convenience of notation we choose the \ formulation of the regularized risk 


functional. 
Let us start with gradients in function space. We use the standard RKHS 
Gradient in regularization ([349] and, later, [221] use gradients in the space £} induced by 
Function Space the values of f on the training set) terms Q[f] = 4||f||5,. With the definitions of 
(4.1) this yields: 
1 m A 2 
Reslf]= — X cn y fd) + SIIB: (10.92) 
i=1 
1 m 
OfRregl fl = = F e'i Yi faka) + AF. (10.93) 


i=1 
Consequently, we obtain the following update rules for f, given a learning rate A, 


m 


f — f — AôpRregl f] = (1 -ANS -A X xi yi fDi). (10.94) 
i=1 

Here the symbol ‘4—”’ means ‘is updated to’. For computational reasons we have 

to represent f as a linear combination of functions in a finite dimensional space 

(the Representer Theorem of Section 4.2 tells us that m basis functions k(x;,x) are 

sufficient for this). With the usual expansion f(-) = X}; aik(x;, +) the update rule for 

the coefficients becomes 


a 4— (1—AA\)a-— Ay = a — A(Aa + y), where qi = c’(Xxi, Yi, f(x). (10.95) 


Distinguishing between the different cases of regression, classification, and classi- 
fication with a Boosting cost function [498, 221] we obtain the derivatives as de- 
scribed in Table 10.2. 

Note that we can obtain update rules similar to the Kernel AdaTron [183] if we 
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Table 10.2 Cost functions and their derivatives for ¢-Regression, Soft-Margin Classifica- 
tion, and Boosting with an exponential cost function. 


Regression 
f)-y-e if f()-y>e if f(x)-y>e 
c=4 y—f(x)-—e ify—f(x)>e ify—f(x)>e 
0 


otherwise otherwise 
Classification 


1—yf(x) ifyf(x) <1 if yf(x) <1 
0 otherwise otherwise 


oaa e | 
c = exp(—y f (x)) c! = —yexp(—yf(x)) 


modify the loss function c to become 


1 —yf(x))? if yf(x) <1 


0 otherwise. 


c(x, y, f(x)) = (10.96) 
On a per-pattern basis, this leads to update rules identical to the ones of the 
Adatron. In particular, if we combine (10.96) with the online extensions of Section 
10.6.3, we fully recover the update rule of the Kernel AdaTron. This means that the 
AdaTron uses squared soft margin loss functions as opposed to the standard soft 
margin loss of SVMs.!4 

Rather than a parametrization in function space H we may also choose to 
start immediately with a parametrization in coefficient space [577, 221]. It is 
straightforward to see that, in the case of RKHS regularization as above (here 
|f ||? = aT Ka), we obtain, with the definitions of y as in (10.94), 


OaRreglf] = * Ky +AKa (10.97) 
a+— a—AdKa—AKy=a—AK(Aa+4). (10.98) 


In other words the updates from (10.95) are multiplied by the kernel matrix K to 
obtain the update rules in the coefficient space. This means that we are performing 
gradient descent in a space with respect to the metric given by K rather than the 
Euclidean metric. The other difference to (10.93) is that it allows us to deal with 
regularization operators other than those based on the RKHS norm of f; Q[f] = 
Z; |ai| for example, (see Section 4.9.2 and [498]). Table 10.3 gives an overview of 
different regularization operators and their gradients. 


12. The strategy for computing b is different though. Since 0,Rrglf] = 
Eh vicl(Xi, Vi, f(xi)) We may also update b iteratively if desired, whereas in the 
AdaTron we must add a constant offset to the kernel function in order to obtain an update 
rule. 


10.6 Iterative Methods 317 


Line Search 


Table 10.3 Gradients of the regularization term. 


afl Regulation 
HHR standard SV regularizer 
1 aT 


2 MIE renormalized SV regularizer | (œ Ka)? Ka 
X; lail sparsity regularizer (sgn(a1),..-,8gN (Qm) 


Xi (x)| & norm on data K'Ka 


Since a unit step in the direction of the negative gradient of Rreg[f] does not 
necessarily guarantee that Rreg[f] will decrease, it is advantageous in many cases 
to perform a line-search in the direction of OaRreg[f], specifically, to seek y such 
that Rreg [f — yOoRreg[f]] is minimized. Details of how this can be achieved are 
in Section 6.2.1, and Algorithm 6.5. Moreover, Section 10.6.3 describes how the 
gradient descent approach may be adapted to online learning, that is, stochastic 
gradient descent. 

We conclude the discussion of gradient descent algorithms by stating a lower 
bound on the minimum value of the regularized risk functional [221], which 
depends on the size of the gradient in function space. 


Theorem 10.5 (Lower Bound on Primal Objective Function) Denote by Remp|f] a 
convex and differentiable functional on a Hilbert space H and consider the regularized risk 
functional 


Rregl f] = Rempl f] + Allfa where A > 0 (10.99) 
Then, for any f, Af EH 


Rreglfl — Reeglf — Af1 $5 VReel fll? (10.100) 


Proof We assume that f — Af is the minimizer of Ryeg[f], since proving the 
inequality for the minimizer is sufficient. Since Remp is convex and differentiable 
we know that 


Rempl f] — RempLf — Af] < (Af, VRempLf])- (10.101) 
Therefore we may bound p(f, Af) := Rreg f] — Rregl f — Af] by 

Af Af) < (Af, VRempL fT) + Aw((|fI)) — Aol — AFI). (10.102) 
It is easy to check that if «(|| fI) = $|[f||?, (10.102) is minimized by 

DAF = VRempl f] + Af = V Rregl f]. (10.103) 
Substituting this back into p(f, Af) proves (10.100). a 


Eq. (10.100) shows that a stopping criterion based on the size of the gradient is a 
feasible strategy when minimizing regularized risk functionals. 
Gradient descent algorithms are relatively simple to implement but we should 
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keep in mind that they often do not enjoy the convergence guarantees of more 
sophisticated algorithms. They are useful tools for a first implementation if no 
other optimization code is available, however. 


10.6.2 Lagrangian Support Vector Machines 


Mangasarian and Musicant [348] present a particularly fast and simple algorithm 
which deals with classification problems involving squared slacks (this is the same 
problem that the AdaTron algorithm also attempts to minimize). Below we show a 
version thereof extended to the nonlinear case. We begin with the basic definitions 


e(x, y, f(x) = : yf) 21 


(1—yf(x))? otherwise. (10.104) 


The second modification needed for the algorithm is that we also regularize the 
constant offset b in the function expansion, i.e. Q[f] = ||w||* + b? where f(x) = 
(w, #(x)) + b. This reduces the number of constraints in the optimization problem 
at the expense of losing translation invariance in feature space. It is still an open 
question whether this modification is detrimental to generalization performance. 
In short, we have the following optimization problem; 
minimize 2S 242 (wir +P) 

wyb,é mea 2 (10.105) 

subject to yi((w, d(x;)) +b) > 1 — €; where €; > 0. 


By using the tools from Chapter 6 (see also [345, 348]) one can show that the dual 
optimization problem of (10.105) is given by 


1 m m 
minimize = aiajyiy (Ki; +1+Am6;,) — ¥ ai 
a RA E f 2 | (10.106) 


subject to a; > 0 for alli € [m] 


where w = D7, yiai®(x;), b = D7, ay, and €; = Amai. 

In the following we develop a recursion relation to determine a solution of 
(10.105). For convenience we use a slightly more compact representation of the 
quadratic matrix in (10.106). We define 


Q := diag(y)(K + \m1+ T'Tdiag(y) (10.107) 


where diag(y) denotes the matrix with diagonal entries y; and 1 is the unit matrix. 
Since a; are Lagrange multipliers, it follows from the KKT conditions (see Theorem 
6.21) that only if the constraints y;((w, ((x;)) +b) > 1 — & of (10.105) are active may 
the Lagrange multipliers a; be nonzero. With the definition of Q we can write these 
conditions as a; > 0 only if (Qa); = 1. Summing over all indices i we have 


a!'(Qa—T1) =0. (10.108) 
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Now, if we can find some a which are both feasible (they satisfy the constraints 
imposed on them) and which also satisfy a! (Qa — T) = 0, then we have found a 
solution. The key optimization algorithm trick lies in the following lemma [348]. 


Lemma 10.6 (Orthogonality and Clipping) Denote by a,b € R” two arbitrary vec- 
tors. Then the following two conditions are equivalent 


{a, b>Oanda'b= o} <> {a = (a — yb)x for all y > 0} (10.109) 
See Problem 10.15 for a proof. 


Rewriting the Consequently it is a condition on a that, for all y > 0, 
KKT Conditi > > 
ee Oe Oa D- ya) (10.110) 
must hold. As previously mentioned [348] a solution a satisfying (10.108) is 
the minimizer of the constrained optimization problem (10.106). Furthermore, 


Lemma 10.6 implies that (10.108) is equivalent to (10.110) for all y > 0. This sug- 
gests an iteration scheme for obtaining a whereby 


als Q7"(((Qai = T) _ ya!) 4 D. (10.111) 


The theorem below shows that (10.111) is indeed a convergent algorithm and that 
it converges linearly. 


Theorem 10.7 (Global Convergence of Lagrangian SVM [348]) For any symmetric 
positive matrix K and Q given by (10.107) under the condition that O < y < 2\Am, the 
iteration scheme (10.111) will converge at a linear rate to the solution & and 


I[Qa t" — Qal| < |1 — yQ™|| - |]Qa’ — Qāll. (10.112) 


Proof By construction & is a fixed point of (10.111). Therefore we have 


[Qa — Qāl| = |(Qa' - T- ya’), — (Qā - T- 7a) 4|| (10.113) 
< |(Q— 7a‘ — )| (10.114) 
< ||1 — 7Q™ || + ]Qa‘ — Qa]. (10.115) 


Next we bound the norm of ||1 — yQu*|| and, in particular, we show under which 
conditions it is less than 1. By construction we know that the smallest eigenvalue 
of Qis at least Am and, moreover, Q is a positive matrix. Hence Q7! is also positive 
and its largest eigenvalue is bounded from above by +. Therefore the largest 
eigenvalue of ||1 — yQu'|| is bounded from above by |1 — y34| and, consequently, 


for all 0 < y < 2Am the algorithm will converge. m 
To make practical use of (10.111) on large amounts of data we need to find a way 
Sherman to invert Q cheaply. Recall Section 10.3.4 where we dealt with a similar problem 
Morrison in the context of interior point optimization codes. Assuming that we can find 


Woodbury a low rank approximation of K, by K = K""(K"")~1(K™")T for example, we may 
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replace K by K throughout the algorithm, apply the Sherman-Woodbury-Morrison 
formula (10.54) and invert Q approximately. 

The additional benefit is that we get a compact representation of the solution 
of the classification problem in a small number of basis functions, n. Thus the 
evaluation of the solution is much faster than if the full matrix K had been used. 
The approximation in this setting ignores the smallest eigenvalues of K, which 
will be dominated by the addition of the regularization term Am1 in the definition 
(10.110) of Q anyway. In analogy to (10.55) we obtain 


(K + Am1 + Ta) T (10.116) 
= (EE +Am1+T" 1) i (10.117) 
= (Am)'1— (am)? | ke Toile T | (10.118) 
where 


K" 9 
Qred = (| 0 1 


Likewise, the matrix multiplications by Q can be sped up by the low rank decom- 
position of K. Overall, the cost of one update step is O(n?m); significantly less than 
O(m°), which would be incurred if we had to invert Q exactly. The same methods 
that can be used to implement any interior point method (out-of-core storage of 
the matrix K”” for example) can also be applied to Lagrangian SVM. 

For the special case that we have only linear kernels k(x, x’) = (x, x’), the update 
rule becomes particularly simple. Here we can represent K as K = X! X where X 
denotes the matrix of all patterns x;. The MATLAB code (courtesy of Mangasarian 
and Musicant) is given in Algorithm 10.5 (in the nonlinear case, we can adapt the 
algorithm easily by replacing X with K""(K"")~?, where K”” and K"” are defined 
as in Section 10.2.1). 


+m | Km T | Km T |): (10.119) 


10.6.3 Online Extensions 


Online learning differs from the settings in the other chapters of this book, which 
study batch learning, insofar as it assumes that we have a (possibly infinite) stream 
of incoming data (x;, yi) € X x Y. The goal is to predict y; and incur as little loss as 
possible during the iterations. This goal is quite different from that of minimizing 
the expected risk since the distribution from which the data (x;, y;) is drawn may 
change over time and thus no single estimate f : X — Y may be optimal over the 
total time (see [32, 54, 378]). 

At every step t we could attempt to perform an optimal prediction based on the 
minimizer of the regularized risk functional Ryeg[f] where our training set consists 
of (x1, Y1), - - -(Xt—1, Yt-1). Unfortunately this task is completely computationally 
infeasible since it would require that we solve an ever-increasing optimization 
problem in t variables at every instance. Hence the time required to perform the 
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Algorithm 10.5 Linear Lagrangian Support Vector Machines 


function [it, opt, w, gamma] = lsvm(X,Y,lambdam, itmax,tol) 


[m,n]=size(X) ; 

gamma=1.9 * nu; 

e=ones(m,1); 

H=Y*[X -e]; 

it=0; 

S=H*inv ((speye (nt+1) *lambdam+H? *H) ) ; 

alpha=(1-S*(H’*e)) / lambdam; 

oldalpha=alphati ; 

while it<itmax & norm(oldalpha-alpha) >tol 
z=(1+p1(((speye (m) *lambdam*alphat+H* (H? *alpha) )-gamma*alpha)-1)) ; 
oldalpha=alpha; 
alpha=(z-S* (H’ +z) ) /lambdam; 
it=itt1; 

end; 

opt=norm(alpha-oldalpha) ; w=X’*Y*alpha;b=-e’ *Y*alpha; 


function pl = pl(x); pl = (abs(x)+x)/2; 


prediction would increase polynomially over time due to the increasing sample 
size. This is clearly not desirable. 

Another problem arises from the Representer Theorem (Th. 4.2). It states that 
the solution is a linear combination of kernel functions k(x;,-) centered at the 
training points. Assuming that the probability of whether a point will become 
a kernel function does not depend on t this shows that, at best, the selection of 
basis functions will change while, typically, the number of basis functions selected 
will grow without bound (see Problem 10.18). This means that prediction will 
also become increasingly expensive and the computational effort is likely to grow 
polynomially. 

From these two problems we conclude that if we want to use an online setting 
we should perform some sort of approximation rather than trying to solve the 
learning problem exactly. 

One possibility is to project every new basis function k(x;,-) onto a set of ex- 
isting basis functions, say k(Xn,*), - - -k(Xny,*) and find a solution in the so-chosen 
subspace. This is very similar to online learning with a neural network with fixed 
architecture. We thus perform learning with respect to the functional 


m 


1 a Ax 

Rregl f] = a pe (x Yi, 5 os) T 7 by KXnj Xn Ojo, (10.120) 
i=1 j=l {j=l 

where f = Na jK(xn,, +). Unfortunately the computational cost is at least O(N?) 

per iteration since computing the gradient of (10.120) with respect to a already 

requires a matrix-vector multiplication, no matter how simple we manage to keep 
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the sample dependent term +57", (xi, yi, f(xi)) 8. This shows that any gradient 
descent algorithm in a lower dimensional fixed space will exhibit this problem. 
Hence, projection algorithms do not appear to be a promising strategy. 

Likewise, incremental update algorithms [93] claim to overcome this problem 
but cannot guarantee a bound on the number of operations required per iteration. 
Hence, we must resort to different methods. 

Recently proposed algorithms [194, 242, 214, 329] perform perceptron-like up- 
dates for classification at each step. Some algorithms work only in the noise free 
case, others do not work for moving targets, and still others assume an upper 
bound on the complexity of the estimators. Below we present a simple method 
which allows the use of kernel estimators for classification, regression, and nov- 
elty detection and which copes with a large number of kernel functions efficiently. 


Stochastic Approximation The following method [299] addresses the problem by 
formulating it in the Reproducing Kernel Hilbert Space H directly and then by 
carrying out approximations during the update process. We will minimize the 
ordinary regularized risk functional (4.2); Rregl f] = Remp[f] + àll f lex Since we 
want to perform stochastic gradient descent, the empirical error term Remp[f] is 
replaced by the empirical error estimate at instance (x+, y+), namely c(x+, yt, f(x1)). 
This means that at time t we have to compute the gradient of 


Rsiocnl f, t] = 0G, yt, FeO) + ŽI fille (10.121) 


and then perform gradient descent with respect to Rstochlf, t]. Here t is either 
randomly chosen from {1,...m} or it is the new training instance observed at 
time t. Consequently the gradient of Rstocnl f, t] with respect to f is 


OpRetocnl ft] = xr Yes FAN -) + Af. (10.122) 
The update equations are thus straightforward, 
f <— f —AOsRstoanlf, t], (10.123) 


where A € Rt is the learning rate controlling the size of updates undertaken at 
each iteration. We will return to the issue of adjusting (A, A) at a later stage. 


Descent Algorithm Substituting the definition of Retocn[f,t] into (10.123) we ob- 

tain 

f — f -A (n ye fka ) + Af) (10.124) 
= (1—AA)f — Ac' (xi, Ye, f(k). (10.125) 


While (10.124) is convenient in a theoretical analysis, it is not directly amenable to 


13. If we decide to use the gradient in function space instead then the gradient itself will be 
cheap to compute, but projection of the gradient onto the N dimensional subspace will cost 
us O(N?) operations. 
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computation. For this purpose we have to express f as a kernel expansion 


f(x) = > aik(xi, x) (10.126) 

where the x; are (previously seen) training patterns. Then (10.126) becomes 

ar 4— (1 — AA)ay — Ac' (xi, yr, f(x) (10.127) 
= A(x, Yt, f(x) for a; =0 (10.128) 

aj +— (1 — AA)aj foris i. (10.129) 


Eq. (10.127) means that, at each iteration, the kernel expansion may grow by one 
term. Further, the cost of training at each step is not larger than the prediction cost. 
Once we have computed f(x;), a; is obtained by the value of the derivative of c at 
(xt, Yes (Xt). 

Instead of updating all coefficients a; we may simply cache the power series 
1,(1 — AA), (1 — AA)?, (1 — AA)?,... and pick suitable terms as needed. This is 
particularly useful if the derivatives of the loss function c only assume discrete 
values, say {—1,0, 1} as is the case when using the soft-margin type loss functions. 


Truncation The problem with (10.127) and (10.129) is that, without any further 
measures, the number of basis functions n will grow without bound. This is not 
desirable since n determines the amount of computation needed for prediction. 
The regularization term helps us here. At each iteration the coefficients a; with 
i Æ t are shrunk by (1 — AA). Thus, after 7 iterations, the coefficient a; will be 
reduced to (1 — AA)7q,. 


Proposition 10.8 (Truncation Error) For a loss function c(x, y, f(x)) with its first 
derivative bounded by C and a kernel k with bounded norm ||k(x, -)|| < X, the trunca- 
tion error in f incurred by dropping terms a; from the kernel expansion of f after T 
update steps is bounded by A(1 — AA)" CX. In addition, the total truncation error due to 
dropping all terms which are at least T steps old is bounded by 


t=T 
If — ferunellac < X, A(G — AAC X < ATH — AA)TCX (10.130) 
i=1 

Here ftrunc = Xii Oik(x;,+). Obviously the approximation quality increases 
exponentially with the number of terms retained. 

The regularization parameter A can thus be used to control the storage require- 
ments for the expansion. Moreover, it naturally allows for distributions P(x, y) 
that change over time in which case it is desirable to forget instances (x;, yi) that 
are much older than the average time scale of the distribution change [298]. 


We now proceed to applications of (10.127) and (10.129) in specific learning sit- 
uations. We utilize the standard addition of the constant offset b to the function 
expansion, g(x) = f(x) +b where f € H and b € R. Hence we also update b into 
b= Að Rstocnlg]. 


Classification We begin with the soft margin loss (3.3), given by c(x, y, g(x)) = 
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max(0,1— yg(x)). In this situation the update equations become 


(1—AA)aj, yiA,b+Ayi) if yg(x) <1 


10.131 
((1 — AA)a;, 0, b) otherwise. ( 


(ai, ar, b) <— 
For classification with the v-trick, as defined in (7.40), we also have to take care of 
the margin p, since there c(x, y, g(x)) = max(0, p — yg(x)) — vp. On the other hand, 
one can show [481] (see also Problem 7.16) that the specific choice of A has no 
influence on the estimate in v-SV classification. Therefore, we may set A = 1 and 
obtain 


(1—A)ai,yiA,b+Ayi,p+AA—v)) if yg(x1) < p 


(10.132) 
(1 — A)a;,0,b, p — Av) otherwise. 


(ai, Qt, b, p) —_ 
By analogy to Propositions 8.3 and 7.5, only a fraction of points will be used for 
updates. Finally, if we choose the hinge-loss, c(x, y, ¢(x)) = max(0, —yg(x)), 


((1—AA)aj, yiA,b+Ayi) if yg(x) <0 


10.133 
((1 — AA)a;, 0, b) otherwise. ( 


(ai, ar, b) <— f 
Setting A = 0 recovers the kernel-perceptron algorithm. For nonzero A we obtain 
the kernel-perceptron with regularization. 


Novelty Detection Results for novelty detection (see Chapter 8 and [475]) are 
similar in spirit. The v-setting is most useful here, particularly where the estimator 
acts as a warning device (network intrusion detection for example) or when we 
would like to specify an upper limit on the frequency of alerts f(x) < p. The 
relevant loss function, as introduced in (8.6), is c(x, y, f(x)) = max(0, p — f(x)) — vp 
and usually [475] one uses f € H rather than f + b, where b € R, in order to avoid 
trivial solutions. The update equations are 


(1—A)aj,A,p+AA—v)) iff) < p 


10.134 
((1 — A)a;, 0, p — Av) otherwise. ( 


(ai, Qt, p) = 
Considering the update of p we can see that, on average, only a fraction of v 
observations will be considered for updates. Thus we only have to store a small 
fraction of the x;. We can see that the learning rate A provides us with a handle 
to trade off the cost of the expansion (in terms of the number of basis functions 
needed) with the time horizon of the prediction; the smaller A, the more patterns 
are included since the coefficients a; will decay only very slowly. Beyond this 
point, further research needs to be done to show how A is best adjusted (a rule 
of thumb is to let it decay as Tm: Figure 10.7 contains initial results of the online 
novelty detection algorithm. 

Algorithm 10.6 describes the learning procedure for novelty detection in detail. 


Regression We consider the following four settings: squared loss, the ¢-insensitive 
loss using the v-trick, Huber’s robust loss function, and trimmed mean estimators. 
For convenience we only use estimates f € H rather than g = f +b where b € R. 
The extension to the latter case is straightforward. 
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Figure 10.7 Online novelty detection on the USPS dataset (dimension N = 256). We use 
Gaussian RBF kernels with width 207 = 0.5N = 128. The learning rate was adjusted to = 
where m is the number of iterations. The left column contains results after one pass through 


the database, the right column results after 10 random passes. From top to bottom: (top) the 
first 50 patterns which incurred a margin error, (middle) the 50 worst patterns according to 
f(x) — p on the training set, (bottom) the 50 worst patterns on an unseen test set. 


= We begin with squared loss (3.8) where c is given by c(x, y, f(x)) = (y — f(x). 
Consequently the update equation is 


(ai, ai) — ((1 — AA)aj, ACY — f (x1))). (10.135) 


This means that we have to store every observation we make or, more precisely, the 
prediction error we make on every observation. 

= The ¢-insensitive loss (see (3.9)) c(x, y, f(x)) = max(0, |y — f(x)| — £) avoids this 
problem but introduces a new parameter — the width of the insensitivity zone e. 
By making € a variable of the optimization problem, as shown in Section 9.3, we 
have 


c(x, y, f (x)) = max(0, |y — f(x)|— £) + ve. (10.136) 
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Algorithm 10.6 Online SV Learning 
input kernel k, input stream of data X, A, v. time horizon T 
Initialize “time” t = 0, 
repeat 
t=t+1 
Draw new pattern x, and compute f(x;) 
Qi << (1 == A)ai 
Update œ = AH(p-— f(x)) 


p +— p+Aw-H(- f(x) 
Truncate the expansion to T terms. 
until no more data arrives 


The update equations now have to be stated in terms of aj, a, and e, which is 
allowed to change during the optimization process. This leads to 


(ai, ar, €) — 
(1 — àA)a; A sgn (y: — f(x), e+ 0- vA) if lyse — fal >e (10.137) 
((1 — AA)a;,0,¢ — Av) otherwise. 


Meaning that every time the prediction error exceeds £ we increase the insensitiv- 
ity zone by Av. Likewise, if it is smaller than e, the insensitive zone is decreased 
by A(1 — v). 

a Finally, we analyze the case of regression with Huber’s robust loss. The loss (see 
Table 3.1) is given by 


-fl-30 fly- flo 


+y- fœ)? otherwise. (10.138) 


c(x, y, f(x) = f 
As before, we obtain update equations by computing the derivative of c with 
respect to f(x). 


((1— Aai, Asgn (yi — f) if lyr — fæl > o 


l (10.139) 
(1 — A)aj, oy: — f(x) otherwise. 


(ai, ai) <— f 
=# Comparing (10.139) and (10.137) leads to the question of whether o might not 
also be adjusted adaptively. This is a desirable goal since we may not know the 
amount of noise present in the data. While the v-setting allows us to form such 
adaptive estimators for batch learning with the ¢-insensitive loss, this goal has 
proven elusive for other estimators in the standard batch setting. In the online 
situation, however, such an extension is quite natural (see also [180]). All we need 
do is make ø a variable of the optimization problem and set 


(Qj, at, 0) <— 
(1 —A)ai, Asen(y: — f(x1)),o +A —v)) ify- f| > o (10.140) 
(1 — A)aj, ay: — f(x1)), o — AV) otherwise. 
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The theoretical analysis of such online algorithms is still an area of ongoing re- 
search and we expect significant new results within the next couple of years. For 
first results see [175, 242, 299, 298, 194, 329]. For instance, one may show [298] that 
the estimate obtained by an online algorithm converges to the minimizer of the 
batch setting. Likewise, [242] gives performance guarantees under the assumption 
of bounded RKHS norms. 

For practitioners, however, currently online algorithms offer an alternative to 
(sometimes rather tricky) batch settings and extend the domain of application 
available to kernel machines. It will be interesting to see whether the integration 
of Bayesian techniques [546, 128] leads to other novel online methods. 


10.7 Summary 


10.7.1 Topics We Did Not Cover 


While it is impossible to cover all algorithms currently used for Kernel Machines 
we give an (incomplete) list of some other important methods. 


Kernel Billiard This algorithm was initially proposed in [450] and subsequently 
modified to accommodate kernel functions in [451]. It works by simulating an 
ergodic dynamical system of a billiard ball bouncing off the boundaries of the 
version space (the version space is the set of all w for which the training data is 
classified correctly). The estimate is then obtained by averaging over the trajecto- 
ries. 


Bayes Point Machine The algorithm [239, 453] is somewhat similar in spirit to the 
Kernel Billiard. The main idea is that, by sampling from the posterior distribution 
of possible estimates (see Chapter 16 for a definition of these quantities), we obtain 
a solution close to the mean of the posterior. 


Iterative Re-weighted Least Squares SVM The main idea is to use clever work- 
ing set selection strategies to identify the subset of SVs that are likely to sit exactly 
on the margin. For those SVs, the separation inequality constraints become equal- 
ities, and, for these, the reduced QP for the working set can be solved via a linear 
system (by quadratically penalizing deviations from the exact equality constraints) 
[407]. This approach can handle additional equality constraints without significant 
extra effort. Accordingly, it has been generalized to situations with further equality 
constraints, such as the v-SVM [408]. 


Maximum Margin Perceptron This algorithm works in primal space and relies 
on the idea that the weight vector w is a linear combination between vectors 
contained in the convex hull of patterns with y; = 1 and the convex hull of patterns 
with y; = —1. That convergence requires a finite number of steps can be proven. 
Moreover, the constant threshold b can be determined rather elegantly. See [309] 
for a variety of versions of the MMP algorithm. 
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AdaTron Originally introduced in [12], the kernel version of the AdaTron ap- 
peared in [183]. It is, essentially, an extension of the perceptron algorithm to the 
maximum margin case. As we saw in Section 10.6, similar update rules can be 
derived with a quadratic soft-margin loss function. 


More Mathematical Programming Methods Once one is willing to go beyond the 
standard setting of regularization in a Reproducing Kernel Hilbert Space, one is 
offered a host of further Support-Vector like methods derived from optimization 
theory. The papers of Mangasarian and coworkers [324, 325, 188, 189] present such 
techniques. 


10.7.2 Topics We Covered 


Several algorithms can be used to solve the quadratic programming problem 
arising in SV regression. Most of them can be shown to share a common strategy 
that can be understood well through duality theory. In particular, this viewpoint 
leads to useful optimization and stopping criteria for many different classes of 
algorithm, since the Lagrange multipliers a; are less interesting quantities than 
the value of the objective function itself. 


Interior Point Codes A class of algorithms to exploit these properties explicitly 
are interior point primal-dual path following algorithms (see Section 10.3). They 
are relatively fast and achieve high solution precision in the case of moderately 
sized problems (up to approximately 3000 samples). Moreover, these algorithms 
can be modified easily for general convex loss functions without additional com- 
putational cost. They require computation and inversion of the kernel matrix K 
however, and are thus overly expensive for large problems. 


Greedy Approximation We presented a way to extend this method to large scale 
problems which makes use of sparse greedy approximation techniques. The latter 
are particularly well suited to this type of algorithm since they find a low rank 
approximation of the dense and excessively large kernel matrix K directly, without 
ever requiring full computation of the latter. Moreover, we obtain sparse (however 
approximate) solutions, independent of the number of Support Vectors. 


Chunking in its different variants is another modification to make large scale 
problems solvable by classical optimization methods. It requires the breaking up 
of the initial problem into subproblems which are then, in turn, solved separately. 
This is guaranteed to decrease the objective function, thus approaching the global 
optimum. Selection rules, in view of duality theory, are given in section 10.4.3. 


Sequential Minimal Optimization (SMO) is probably one of the easiest algo- 
rithms to implement for SV optimization. It might thus be the method of choice for 
a first attempt to implement an SVM. It exhibits good performance, and proofs of 
convergence have been obtained. Recent research has pointed out several ways of 
improving the basic algorithm. We briefly sketched one technique which improves 
the estimation of the constant threshold b and thus also helps select good subsets 
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more easily. Pseudocode for regression and classification conclude this section. 


Iterative Methods Finally, another class of algorithms can be summarized as it- 
erative methods, such as gradient descent, Lagrangian SVM which are extremely 
simple but which are only applicable for a specific choice of cost function, and on- 
line support vector methods. These have the potential to make the area of large 
scale problems accessible to kernel methods and we expect good progress in the 
future. While it is far from clear what the optimal strategy might be, it is our hope 
that the reasoning of Section 10.6.3 will help to propel research in this area. 


10.7.3 Future Developments and Code 


We anticipate that future research will focus on efficient approximate and sparse so- 
lutions. This means that, quite often, it is not necessary to find a kernel expansion 
f =i", aik(x;, -), where the a; are the Lagrange multipliers of the corresponding 
optimization problem and, instead, a much more compact function representation 
can be used. 

Second, we expect that both lightweight optimizers, which can be deployed 
on small consumer hardware, and large-scale optimizers, which take advantage 
of large clusters of workstations, will become available. It is our hope that, in a 
few years, kernel methods will be readily accessible as plug-ins and toolboxes for 
many Statistical analysis packages. 

Finally, the choice of parameters is still a problem which requires further at- 
tention. While there exist several promising approaches (see [288, 102, 268]) for 
assessing the generalization performance, mainly involving leave-one-out esti- 
mates or their approximation, the problem is far from solved. In particular, every 
new bound on the generalization performance of kernel machines will inevitably 
prompt the need for an improved training algorithm which can take advantage of 
the bound. 

Some readers will miss pointers to readily available code for SVM in this book. 
We deliberately decided not to include such information since such information is 
likely to become obsolete rather quickly. Instead, we refer the reader to the kernel- 
machines website (http://www.kernel-machines.org) for up-to-date information 
on the topic. 


10.8 Problems 


10.1 (KKT Gap for Linear Programming Regularizers ee) 
Compute the explicit functional form of the KKT gap for Linear Programming Regulariz- 
ers. Why can't you simply use the expansion coefficients œ; as in Proposition 10.1? 


10.2 (KKT Gap for Sub-Optimal Offsets ee) 
Compute the functional form of the KKT gap for non-optimal parametric parts in the 
expansion of f, e.g., if f(x) = (P(x),w) + b where b is not optimal. Hint: consider 


330 


Implementation 


Theorem 6.22 and prove a variant of Theorem 6.27. 


10.3 (Restarting for À e) 
Prove an analogous inequality for Rreg|f] as (10.22) for A rather than C, i.e. prove 


Rregl fy, '] > Rregl fy, A] > Rregl fa, Al for all X > A. (10.141) 


10.4 (Sparse Approximation in the Function Values ee) 
State the optimal expansion for approximations of k(x;,-) by k;(-) in the space of function 
values on X = {x1,..., Xm}. How many operations does it cost to compute the expansion? 


10.5 (Rank-1 Updates for Cholesky Decompositions ee) 
Given a positive definite matrix K € R"*", its Cholesky decomposition TT! = K into 
triangular matrices T € R"*", a vector k € R", and a real number «q, such that the matrix 


K 
is positive definite, show that the Cholesky decomposition of the larger matrix 


kT 
is given by 
K k T 0 T 
= (10.142) 
k! k t T 0 T 
where 
t=Tkand T = (s = rij = (10.143) 


Why would we replace the equation for r by T = max (0, (k — tTt)*) for numerical 
stability? How can you compute (10.143) in O(n?) time? 


10.6 (Smaller Memory Footprint for SGMA eee) 
Show that rather than caching K™", (K"")-! and a = (K"")~'K"™" (and updating the 
three matrices accordingly) we can reformulate the sparse greedy matrix approximation 
algorithm to use only T;, and T7+K™” where T, is the Cholesky decomposition (see Problem 
10.5) of K"" into a product of triangular matrices, that is T,T,| = K". 

In particular, show that the update for T7*K™" is given by 


= pol Km 
Tee | at east) 
n 


where t and T are defined as in (10.143). 


(10.144) 


10.7 (Optimality of PCA eee) 

Show that for the problem of approximating a positive definite matrix K by a matrix 
K of rank n such that both K and K — K are positive definite the solution is given by 
projecting onto the largest n principal components of K, i.e., by PCA. Here we consider an 
approximation to be optimal if the residual trace of K — K is minimized. Show that PCA is 
also optimal if we consider the largest eigenvalue of K — K as the quantity to be minimized. 
Hint: recall that K is a Gram matrix for some x;, i.e., Kij = (xi, i): 
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10.8 (General Convex Cost Functions [516] ee) 
Show that for a convex optimization problem 
minimize igla) + (c, a) 
subjectto Aa=d (10.145) 


l<a<u 


with c,a,l,u € R”, A € R"”, and d € R", the inequalities between vectors holding 
component-wise, and q(a) being a convex function of a, the dual optimization problem 
is given by 


maximize 5 (q(a) = (daq(a), a)) a (d,h) + (1,2) Te (u,s) 
subject to $0,q(a)+c—(Ay)' +s=z (10.146) 
s,z > 0, h free 
Moreover, the KKT conditions read 


gizi = sit; = 0 for alli € [m]. (10.147) 


10.9 (Interior Point Algorithm for (10.145) [516] ee) 

Derive an interior point algorithm for the optimization problem given in (10.145). Hint: 
use a quadratic approximation of q(a) for each iteration and apply an interior point code 
to this modification. Which cost functions does this allow you to use in an SVM? 


10.10 (Sherman-Woodbury-Morrison for Linear SVM [168] ee) 
Show that for linear SVMs the cost per interior point iteration is O(mn?). Hint: use 
(10.54) to solve (10.48). 


10.11 (KKT Gap and Optimality on Subsets ee) 
Prove that after optimization over a subset Sw (and adjusting b in accordance to the subset) 
the corresponding contributions to the KKT gap, i.e. the terms KKT; for i € Sw will vanish. 


10.12 (SVMTorch Selection Criteria [502, 108] e) 
Derive the SVMTorch optimality criteria for SV regression; derive the equations analogous 
to (10.60) for the regression setting. 


10.13 (Gradient Selection and KKT Conditions e) 

Show that for regression and classification the patterns selected according to (10.60) are 
identical to those chosen by the gradient selection rule, i.e. according to Qa + c. Hint: 
show that gradient and KKT; differ only in the constant offset b, hence taking the maxima 
of both sets yields identical results. 


10.14 (SMO and Number of Constraints e) 

Show that for SMO to make any progress we need at least n + 1 variables where n is 
the number of equality constraints in Aq = d. What does this mean in terms of speed of 
optimization? 
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Find a reformulation of the optimization problem which can do without any equality 
constraints. Hint: drop the constant offset b from the expansion. State the explicit solution 
to the constrained optimization problem in one variable. 

Which selection criteria would you use in this case to find good patterns? Can you adapt 
KKT;, KKT;, and KKT; accordingly? Can you bound the improvement explicitly? 


10.15 (Orthogonality and Clipping e) 
Prove Lemma 10.6. Hint: first prove that for a,b € IR (10.109) holds. The lemma then 
follows by summation over the coordinates. 


10.16 (Lagrangian Support Vector Machines for Regression ee) Derive a variant 
of the Lagrangian Support Vector Algorithm for regression. Hint: begin with a regular- 
ized risk functional where b is regularized and the squared €-insensitive loss function 
c(x, y, f(x)) = max(| f(x) — y| — €,0)?. Next derive an equation analogous to (10.110). 

For the iteration scheme you may want to take advantage of orthogonal transformations 
such as the one given in (10.52). What is the condition on y in this case? 


10.17 (Laplace Approximation ee) 

Use Newton's method as described in (6.12) to find an iterative minimization scheme for 
the regularized risk functional. See also Section 16.4.1 for details. For which cost functions 
is it suitable (Hint: Newton’s method is a second order approach)? Can you apply the 
Sherman-Morrison-Woodbury formula to find quick approximate minimizers? Compare 
the algorithm to the Lagrangian Support Vector Machines. 


10.18 (Online Learning and Number of Support Vectors ee) 

Show that for a classification problem with nonzero minimal risk the number of Support 
Vectors increases linearly with the number of patterns, provided one chooses a regulariza- 
tion parameter that avoids overfitting. Hint: first show that all misclassified patterns on 
the training set will become Support Vectors, then show that the fraction of misclassified 
patterns is non-vanishing. 


10.19 (Online Learning with v-SVM eee) 
Derive an online version of the v-SVM classification algorithm. For this purpose begin 
with the modified regularized risk functional as given by 


Ralf = 5 È coen Yon FED -0v + lfe (10.148) 
Next replace + Yi", Co(Xi, Yi, f(xi)) by the stochastic estimate c (x1, yt, f(x1)). Note that 
you have to perform updates not only in f but also in the margin p. 

What happens if you change p rather than letting aœ; decay in the cases where no margin 
error occurs? Why don’t you need A any more? Hint: consider the analogous case of 
Chapter 9. 


11 


Overview 


Prerequisites 


Incorporating Invariances 


Practical experience has shown that in order to obtain the best possible perfor- 
mance, prior knowledge about a problem at hand ought to be incorporated into 
the training procedure. We describe and review methods for incorporating prior 
knowledge on invariances in Support Vector Machines, provide experimental re- 
sults, and discuss their respective merits, gathering material from various sources 
[471, 467, 478, 134, 562]. 

The chapter is organized as follows. The first section introduces the concept 
of prior knowledge, and discusses what types of prior knowledge are used in 
pattern recognition. Following this, we will deal specifically with transformation 
invariances (Section 11.2), discussing two rather different approaches to make 
SVMs invariant: by generating virtual examples from the SVs (Section 11.3) or by 
modifying the kernel function (Section 11.4). Finally, in Section 11.5, we combine 
ideas from both approaches by effectively making the virtual examples part of the 
kernel definition. 

The prerequisites for the chapter are largely limited to basics of the SV classifi- 
cation algorithm (Chapter 7) and some knowledge of linear algebra (Appendix B). 
Section 11.4.2 uses Kernel PCA, as described in Section 1.7 and Chapter 14. 


11.1 Prior Knowledge 


In 1995, LeCun et al. [320] published a pattern recognition performance compari- 
son noting the following: 


“The optimal margin classifier has excellent accuracy, which is most remarkable, 
because unlike the other high performance classifiers, it does not include a priori 
knowledge about the problem. In fact, this classifier would do just as well if the 
image pixels were permuted by a fixed mapping. [...] However, improvements are 
expected as the technique is relatively new.” 


Two things have changed in the five years since this statement was made. First, 
optimal margin classifiers, or Support Vector Machines, as they are now known, 
have become a mainstream method which is part of the standard machine learning 
toolkit. Second, methods for incorporating prior knowledge into optimal margin 
classifiers are now part of the standard SV methodology. 
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These two developments are actually closely related. Initially, SVMs had been 
considered a theoretically elegant spin-off of the general, but apparently largely 
useless, VC-theory of statistical learning. In 1995, using the first methods for 
incorporating prior knowledge, SVMs became competitive with the state of the 
art, in the handwritten digit classification benchmarks that were popularized by 
AT&T Bell Labs [471]. At that point, application engineers who were not interested 
in theory, but in results, could no longer ignore SVMs. In this sense, the methods 
described below actually helped pave the way to make the SVM a widely used 
machine learning tool. 

By prior knowledge we refer to all information about the learning task which is 
available in addition to the training examples. In this most general form, only prior 
knowledge makes it possible to generalize from the training examples to novel test 
examples. 

For instance, many classifiers incorporate general smoothness assumptions about 
the problem. A test pattern which is similar to one of the training examples 
thus tends to be assigned to the same class. For SVMs, using a kernel function 
k amounts to enforcing smoothness with a regularizer ||Yf||?, where f is the 
estimated function, and k is a Green’s function of the regularization operator Y 
(Chapter 4). In a Bayesian maximum-a-posteriori setting, this corresponds to a 
smoothness prior of exp(—||Y f||*) ([295], see Section 16.2.3). 

A second method for incorporating prior knowledge, which is somewhat more 
specific, consists of selecting features which are thought to be particularly informa- 
tive or reliable for the task at hand. For instance, in handwritten character recog- 
nition, correlations between image pixels that are nearby tend to be more reliable 
than those between distant pixels. The intuitive reason for this is that variations 
in writing style tends to leave the local structure of a handwritten digit fairly un- 
changed, while the global structure is usually quite variable. In the case of SVMs, 
this type of prior knowledge is readily incorporated by designing polynomial ker- 
nels which mainly compute products of nearby pixels (Section 13.3). 

One way to look at feature selection is that it changes the representation of 
the data, in which respect it resembles another method for incorporating prior 
knowledge in SVMs that has recently attracted attention. In the latter case, it is 
assumed that we have knowledge about probabilistic models generating the data. 
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Figure 11.1 Different ways of incorporating invariances in a decision function. The dashed 
line marks the “true” boundary, disks and circles are the training examples. We assume that 
prior information tells us that the classification function only depends on the norm of the 
input vector (the origin being in the center of each picture). Lines through the examples 
indicate the type of information conveyed by the different methods for incorporating prior 
information. Left: virtual examples are generated in a localized region around each training 
example; middle: a regularizer is incorporated to learn tangent values (cf. [497]); right: 
the representation of the data is changed by first mapping each example to its norm. If 
feasible, the latter method yields the most information. However, if the necessary nonlinear 
transformation cannot be found, or if the desired invariances are of a localized nature, we 
have to resort to one of the former techniques. Finally, note that examples close to the 
boundary allow us to exploit prior knowledge very effectively: given a method to get a first 
approximation of the true boundary, the examples closest to the approximate boundary 
allow good estimation of the true boundary. A similar two-step approach is pursued in 
Section 11.3. (From [471].) 


Specifically, let p(x|©) be a generative model that characterizes the probability of 
a pattern x, given the underlying parameter ©. It is possible to construct a class 
of kernels which are invariant with respect to reparametrizations of ©, and which, 
loosely speaking, have the property that k(x, x’) is the similarity of x and x’, subject 
to the assumption that they both stem from the generative model. These kernels 
are called Fisher kernels ([258], cf. Chapter 13). A different approach to designing 
kernels based on probabilistic models is presented in [585, 333]. 

Finally, we get to the type of prior knowledge that we shall start with: prior 
knowledge about invariances. 


11.2 Transformation Invariance 


In many applications of learning procedures, certain transformations of the input 
are known to leave function values unchanged. At least three different ways of 
exploiting this knowledge have been used (illustrated in Figure 11.1): 


1. In the first case, the knowledge is used to generate artificial training examples, 
termed “virtual examples,” [18, 413, 2] by transforming the training examples 
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accordingly. It is then hoped that given sufficient time, the learning machine will 
automatically learn the invariances from the artificially enlarged training data. 


2. In the second case, the learning algorithm itself is modified. This is typically 
done by using a modified error function which forces a learning machine to 
construct a function with the desired invariances [497]. 


3. Finally, in the third case, the invariance is achieved by changing the represen- 
tation of the data by first mapping them into a more suitable space; this approach 
was pursued for instance in [487] and [575]. The data representation can also be 
changed by using a modified distance metric, rather than actually changing the 
patterns [496]. 


Simard et al. [497] compare the first two techniques and find that for the prob- 
lem considered — learning a function with three plateaus, where function values 
are locally invariant — training on the artificially enlarged data set is significantly 
slower, due both to correlations in the artificial data, and the increase in training 
set size. Moving to real-world applications, the latter factor becomes even more 
important. If the size of a training set is multiplied by a number of desired in- 
variances (by generating a corresponding number of artificial examples for each 
training pattern), the resulting training sets can get rather large, such as those used 
in [148]. However, the method of generating virtual examples has the advantage 
of being readily implemented for all kinds of learning machines and symmetries. 
If instead of continuous groups of symmetry transformations we are dealing with 
discrete symmetries, such as the bilateral symmetries of [576], derivative-based 
methods such as those of [497] are not applicable. It is thus desirable to have an 
intermediate method which has the advantages of the virtual examples approach 
without its computational cost. 

The methods described in this chapter try to combine merits of all the ap- 
proaches mentioned above. The Virtual SV method (Section 11.3) retains the flex- 
ibility and simplicity of virtual examples approaches, while cutting down on their 
computational cost significantly. The Invariant Hyperplane method (Section 11.4), 
on the other hand, is comparable to the method of [497] in that it is applicable 
to all differentiable local 1-parameter groups of local invariance transformations, 
comprising a fairly general class of invariances. In addition, it has an equivalent 
interpretation as a preprocessing operation applied to the data before learning. In 
this sense, it can also be viewed as changing the representation of the data to be 
more invariant, in a task-dependent way. Another way to interpret this method 
is as a way to construct kernels that respect local image structures; this will be 
discussed further in a later chapter (Section 13.3). The latter interpretation gives 
rise to a resemblance to the last technique that we discuss, the Jittered SV method 
(Section 11.5), which combines the flexibility of the VSV method with the elegance 
of an approach that directly modifies the kernel and does not need to enlarge the 
training set. 
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In Section 7.8.2, it has been argued that the SV set contains all information nec- 
essary to solve a given classification task. It particular, it is possible to train any 
one of three different types of SVMs solely on the Support Vector set extracted 
by another machine, with a test performance no worse than after training on the 
full database. Using this finding as a starting point, we now investigate whether 
it might be sufficient to generate virtual examples from the Support Vectors only. 
After all, we might hope that it does not add much information to generate virtual 
examples from patterns which are not close to the boundary. In high-dimensional 
cases, however, care has to be exercised regarding the validity of this intuitive 
picture. Thus, an experimental test on a high-dimensional real-world problem is 
imperative. In our experiments, we proceeded as follows (cf. Figure 11.2): 


1. Train a Support Vector Machine to extract the Support Vector set. 


2. Generate artificial examples by applying the desired invariance transforma- 
tions to the Support Vectors. In the following, we will refer to these examples as 
Virtual Support Vectors (VSVs). 


3. Train another Support Vector Machine on the examples generated. 


Clearly, the scheme can be iterated; care must be exercised, however, since the 
iteration of local invariances can lead to global invariances which are not always 
desirable — consider the example of a 6’ rotating into a ’9’ [496]. 

If the desired invariances are incorporated, the curves obtained by applying 
Lie group transformations to points on the decision surface should have tangents 
parallel to the latter (cf. [497]).! If we use small Lie group transformations (e.g., 
translations) to generate the virtual examples, this implies that the Virtual Support 
Vectors should be approximately as close to the decision surface as the original 
Support Vectors. Hence, they are fairly likely to become Support Vectors after the 
second training run. Vice versa, if a substantial fraction of the Virtual Support 
Vectors turn out to become Support Vectors in the second run, we have reason to 
expect that the decision surface does have the desired shape. 

Let us now look at some experiments validating the above intuition. The first 
set of experiments was conducted on the USPS database of handwritten digits 
(Section A.1). This database has been used extensively in the literature, with a 
LeNet1 Convolutional Network achieving a test error rate of 5.0% [318]. As in 


1. The reader who is not familiar with the concept of a Lie group may think of it as a 
group of transformations where each element is labelled by a set of continuously variable 
parameters. Such a group may be considered also to be a manifold, where the parameters 
are the coordinates. For Lie groups, it is required that all group operations are smooth maps. 
It follows that we can, for instance, compute derivatives with respect to the parameters. 
Examples of Lie groups are the translation group, the rotation group, and the Lorentz 
group; further details can be found in textbooks on differential geometry, e.g., [120]. 
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Figure 11.2 Suppose we have prior knowledge indicating that the decision function 
should be invariant with respect to horizontal translations. The true decision boundary 
is drawn as a horizontal line (top left); as we are just given a limited training sample, how- 
ever, different separating hyperplanes are conceivable (top right). The SV algorithm finds 
the unique separating hyperplane with maximal margin (bottom left), which in this case is 
quite different from the true boundary. For instance, it leads to incorrect classification of the 
ambiguous point indicated by the question mark. Making use of the prior knowledge by 
generating Virtual Support Vectors from the Support Vectors found in a first training run, 
and retraining on these, yields a more accurate decision boundary (bottom right). Further- 
more, note that for the example considered, it is sufficient to train the SVM only on virtual 
examples generated from the Support Vectors. 


Section 7.8.1, we used C = 10. 

Table 11.1 shows that incorporating only translational invariance already im- 
proves performance significantly, from an error rate of 4.0% to 3.2%.” For other 
types of invariances (Figure 11.3), we also found improvements, albeit smaller 
ones: generating Virtual Support Vectors by rotations or by line thickness trans- 
formations,’ we constructed polynomial classifiers with a 3.7% error rate (in both 


2. For anumber of reference results, cf. Table 7.4. 
3. Briefly, the idea of the line thickness transformation of an image is to add some multiple 
of its gradient. As the original outline of the image is the area of highest gradient, this 
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Figure 11.3 Different invariance transformations in the case of handwritten digit recogni- 
tion (MNIST database). In all three cases, the central pattern is the original which is trans- 
formed into “virtual examples” (marked by gray frames) with the same class membership, 
by applying small transformations. 


cases). For details, see [467]. 

Note, moreover, that generating Virtual examples from the full database rather 
than just from the SV sets did not improve the accuracy, nor did it substantially 
enlarge the SV set of the final classifier. This finding was reproduced for a Virtual 
SV system with a Gaussian RBF kernel [482]: in that case, as in Table 11.1, generat- 
ing Virtual examples from the full database led to identical performance, and only 
slightly increased the SV set size (861 instead of 806). We conclude that for this 
recognition task, it is sufficient to generate Virtual examples only from the SVs 
— Virtual examples generated from the other patterns do not add much useful 
information. 

The larger a database, the more information about invariances of the decision 
function is already contained in the differences between patterns of the same class. 
To show that it is nonetheless possible to improve classification accuracy using 
our technique, we applied the method to the MNIST database (Section A.1) of 
60000 handwritten digits. This database has become the standard for performance 
comparisons at AT&T Bell Labs; the error rate record of 0.7% was held until 
recently by a boosted LeNet4 [64, 321], which represents an ensemble of learning 
machines; the best single machine performance was achieved at this time by a 
LeNet5 convolutional neural network (0.9%). Other high performance systems 
include a Tangent Distance nearest neighbor classifier (1.1%). 


procedure tends to make lines thicker [148]. 
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Table 11.1 Comparison of Support Vector sets and performance for training on the original 
database and training on the generated Virtual Support Vectors, for the USPS database 
of handwritten digits. In both training runs, we used a polynomial classifier of degree 3. 
Virtual Support Vectors were generated by simply shifting the images by one pixel in the 
four principal directions. Adding the unchanged Support Vectors, this leads to a training set 
for the second classifier which has five times the size of the first classifier’s overall Support 
Vector set (i.e., the union of the 10 Support Vector sets from the binary classifiers, of size 1677 
— note that due to some overlap, this is smaller than the sum of the ten Support Vector set 
sizes). Note that training on virtual patterns generated from all training examples does not 
lead to better results than in the Virtual SV case; moreover, although the training set in this 
case is much larger, it barely leads to more SVs. 


[C dassie trainedon || sive | av- no- of SVs | test error | 
full training set | 7291 274 


[overs sv set |r 2e 
C Vimse | ss 8 
[Virtual patterns from full DB | 36255 | 719 


Using Virtual Support Vectors generated by 1-pixel translations into the four 
principal directions, we improved a degree 5 polynomial SV classifier from an 
error rate of 1.4% to 1.0% on the 10000 element test set. We applied our technique 
separately for all ten Support Vector sets of the binary classifiers (rather than 
for their union) in order to avoid having to deal with large training sets in the 
retraining stage. In addition, note that for the MNIST database, we did not attempt 
to generate Virtual examples from the whole database, as this would have required 
training on a very large training set.4 

After retraining, the number of SVs more than doubles [467] Thus, although the 
training sets for the second set of binary classifiers are substantially smaller than 
the original database (for four Virtual SVs per SV, four times the size of the original 
SV sets, in our case amounting to around 10$), we conclude that the amount of 
data in the region of interest, close to the decision boundary, more than doubles. 
Therefore, it should be possible to use a more complex decision function in the 
second stage (note that the typical risk bounds (Chapter 5) depend on the ratio of 
VC-dimension and training set size). Indeed, using a degree 9 polynomial leads to 
an error rate of 0.8%, very close to the boosted LeNet4 (0.7%). 

Recently, several systems have been designed which are on par with or better 
than the boosted LeNet4 [536, 134, 30]. The new record is now held by a virtual 
SV classifier which used more virtual examples, leading to results which are 


4. We did, however, compute such a solution for the small MNIST database (Section A.1). 
In this case, a degree 5 polynomial classifier was improved from an error of 3.8% to 2.5% 
using the Virtual SV method, with an increase of the average SV set sizes from 324 to 823. By 
generating Virtual examples from the full training set, and retraining on these, we obtained 
a system which had slightly more SVs (939), but an unchanged error rate. 
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Table 11.2 Summary of error rates on the MNIST handwritten digit set, using the 10,000 
element test set (cf. Section A.1). At 0.6% (0.56% before rounding), the VSV system using 12 
translated virtual examples per SV (described in the text) performs best. 


LeNet5 | 0.8% | [319] 
Boosted LeNet4 | 0.7% | [64,319] 


Virtual SVM with 8 VSVs per SV | 0.7% | present chapter, [134] 


Virtual SVM with 12 VSVs per SV | 0.6% | present chapter, [134] 


actually superior to the boosted LeNet4. Since this dataset is considered the “gold 
standard” of classifier benchmarks, it is worth reporting some details of this study. 
Table 11.3 summarizes the results, giving the lowest published test error for this 
data set (0.56%, [134]). Figure 11.4 shows the 56 misclassified test examples. For 
reference purposes, Table 11.2 gives a summary of error rates on the MNIST set. 

As above, a polynomial kernel of degree 9 was used. Patterns were deslanted 
and normalized so that dot-products giving values within [0,1] yielded kernel 
values within [0, 1]; specifically, 

_ 1 
-512 
This ensures that the kernel value 1 has the same meaning that holds for other 
kernels, such as radial-basis function (RBF) kernels. Namely, a kernel value of 1 
corresponds to the minimum distance (between identical examples) in the feature 
space. It was ensured that any dot-product was within [0,1] by normalizing each 
example by its 2-norm scalar value (such that the dot product of each example 
with itself gave a value of 1). 

The value of the regularization constant C (the upper bound on the a;) was 
determined as follows. By trying a large value when training a binary recognizer 
for digit “8”, it was determined that no training example reached an a; value above 
7,and only a handful of examples in each of the 10 digit classes had alpha values 
above 2. Under the assumption that only a few training examples in each class 
are particularly noisy and that digit “8” is one of the harder digits to recognize, 


k(x, x’) ((x, x’) +1). (11.1) 
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Figure 11.4 The 56 test errors of the record VSV system for MNIST. The number in the 
lower left corner of each image box indicates the test example number (1 through 10000). 
The first number in the upper right corner represents the predicted digit label, and the 
second denotes the true label (from [134]). 


we chose C = 2. This simplistic choice is likely suboptimal, and could possibly be 
improved by (time-consuming) experiments on validation sets. 

The experiments employed new SMO-based methods (see Chapter 10 for a 
description of SMO), as described in [134], including a technique called digestion, 
which switches SMO from “full” to “inbounds” iteration every time the working 
set of SVs grows by a large amount. The resulting faster training times can be put 
to good use by trying the VSV method with more invariance than was practical in 
previous experiments. In our case, it was possible to generate VSVs by translation 
of each SV by a distance of one pixel in any of the 8 directions (horizontal, vertical, 
or both), plus two-pixel translations horizontally and vertically. Total training 
time, for obtaining 10 binary recognizers for each of the base SV and the VSV 
stages, was about 2 days (50 hours) on a Sun Ultra60 workstation with a 450MHz 
processor and 2 Gigabytes of RAM (allowing an 800Mb kernel cache). The VSV 
stage was more expensive that the base SV stage (averaging about 4 hours versus 
1 hour) — a majority of examples used in VSV training typically ended up being 
Support Vectors. 
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Table 11.3 Results for the MNIST benchmark (10000 test examples), using an inhomoge- 
neous polynomial kernel of degree 9 and deslanted data. The VSV system was trained using 
12 virtual examples per SV. First table: test error rates of the multi-class systems; second table: 
error rates of the individual binary recognizers that were trained for each digit class, third 
table: numbers of SVs (from [134]). 
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We conclude this section by noting that the same technique has also been ap- 
plied in the domains of face classification and object recognition. In these areas, 
where we are dealing with bilaterally symmetric objects, useful virtual SVs can 
also be generated by reflection with respect to a vertical axis. This kind of discrete 
transformation does not arise from a continuous group of transformations, but this 
does not present a problem with the VSV approach [467]. 
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In this section, we describe a self-consistency argument for obtaining invariant 
SVMs. Interestingly, it will turn out that the criterion we end up with can be 
viewed as a meaningful modification of the kernel function. 
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11.4.1 Invariance in Input Space 


We need a self-consistency argument because we face the following problem: to 
express the condition of invariance of the decision function, we already need to 
know its coefficients a;, which are found only during the optimization; this in 
turn should already take into account the desired invariances. As a way out of 
this circle, we use the following ansatz: consider the linear decision functions 
f = sgn og, where g is defined as 


g(x) = >» ay; (Bx ;, Bx;) +b, (11.2) 
i= 


with a matrix B to be determined below. This follows a suggestion of [561], using 
the conjecture that we could incorporate invariances by modifying the dot product 
used. Any nonsingular B defines a dot product, which can equivalently be written 
in the form (x;, Ax;), with a positive definite matrix A = B'B. 

Clearly, if g is invariant under a class of transformations of the x jr then the same 
holds true for f = sgn o g, which is what we are aiming for. Strictly speaking, 
however, invariance of g is not necessary at points which are not Support Vectors, 
since these lie in a region where (sgn o g) is constant. 

The above notion of invariance refers to invariance when evaluating the deci- 
sion function. A different notion could relate to whether the separating hyper- 
plane, including its margin, changes if the training examples are transformed. It 
turns out that when discussing the invariance of g rather than f, these two con- 
cepts are closely related. In the following argument, we restrict ourselves to the 
separable case (é; = 0 for all i = 1,...,m). As the separating hyperplane and its 
margin are expressed in terms of Support Vectors, locally transforming a Support 
Vector x; changes the hyperplane or the margin if g(x;) changes: if |g| gets smaller 
than 1, the transformed pattern lies in the margin, and the recomputed margin 
is smaller; if |g| gets larger than 1, the margin might become larger, depending 
on whether the pattern can be expressed in terms of the other SVs (cf. the remarks 
preceding Proposition 7.4). In terms of the mechanical analogy of Section 7.3: mov- 
ing Support Vectors changes the mechanical equilibrium for the sheet separating 
the classes. Conversely, a local transformation of a non-Support Vector x; never 
changes f, even if the value g(x;) changes, as the solution of the problem is ex- 
pressed in terms of the Support Vectors only. 

In this sense, invariance of f under local transformations of the given data 
corresponds to invariance of (11.2) for the Support Vectors. Note, however, that 
this criterion is not readily applicable: before finding the Support Vectors in the 
optimization, we already need to know how to enforce invariance. Thus the above 
observation cannot be used directly, however it could serve as a starting point for 


5. As usual, sgn o g denotes composition of the functions sgn and g; in other words, the 
application of sgn to the result of g. 
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constructing heuristics or iterative solutions. In the Virtual SV method described 
in Section 11.3, a first run of the standard SV algorithm is carried out to obtain an 
initial SV set; similar heuristics could be applied in the present case. 

Local invariance of g for each pattern x; under transformations of a differen- 
tiable local 1-parameter Lie group of local transformations £+, 


ð 

lse (11.3) 
can be approximately enforced by minimizing the regularizer 

1⁄4 /8 2 

= > (5; sh) ; (11.4) 


Note that the sum may run over labelled as well as unlabelled data, so in principle 
we could also require the decision function to be invariant with respect to trans- 
formations of elements of a test set, without looking at the test labels. In addition, 
we could use different transformations for different patterns. 

For (11.2), the local invariance term (11.3) becomes 


fa) m 
Bt 98 Mi) j -2| AO ay; (BL:x;, Bx;) +) 


= = Laws, (BL;x;, Bx;) 


m 


= =2 aiyid; (BLox;, Bx;) - B biti (11.5) 


zl- 
using the chain rule. Here, 0; (BLox;,Bx;) denotes the gradient of (x, x’) with 
respect to x, evaluated at the point (x, x’) = (BLox;, Bxi). 

As an ring note that a sufficient, albeit rather strict condition for invariance is 
thus that 2 p BEI j Bxi) vanish for all i, j [86]; we will proceed in our deriva- 
tion, howa * with the goal to impose weaker conditions which apply for one 
specific decision function, rather than simultaneously for all decision functions 
expressible through different choices of the coefficients ajyj. 

Substituting (11.5) into (11.4), and using the relations Lo = 1 (the identity) and 
01 (x, x!) = x'', yields the regularizer 


2 
1 m m =e ð 
S mÈ > aiyi(Bx;) ala) 


1 m 


m gB m 8 
E7 m 2 (š ayi(Bxi) Ba je) (3 oxy Ba oe) 


= > ayiony, (Bx, BTB™ Bx) (11.6) 
i,k=1 
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where 
1 Jt 0 0 T 
T=- È (5; i) (2l) . (11.7) 


We now choose B such that (11.6) reduces to the standard SV target function (7.10), 
in the form obtained following the substitution of (7.15) (cf. the quadratic term of 
(7.17)), where we utilize the dot product chosen in (11.2) so that 


(Bxi, BTB" Bx) = (Bx;,Bx,). (11.8) 
A sufficient condition for this to hold is 
B'BTB'B=B'B, (11.9) 


or, by requiring B to be nonsingular (meaning that no information get lost during 
the preprocessing), BTB' = 1. This can be satisfied by a preprocessing matrix 


B=T~2, (11.10) 


the positive definite square root of the inverse of the positive definite matrix 
T defined in (11.7). We have thus transformed the standard SV programming 
problem into one which uses an invariance regularizer instead of the standard 
maximum margin regularizers, simply by choosing a different dot product, or 
equivalently, by making use of a linear preprocessing operation. 

In practice, we usually want something in between. In other words, we want 
some invariance, but still a reasonably large margin. To this end, we use a matrix 


Ty :=(1—A)T +A, (11.11) 


with 0 < A <1, instead of T. As T is positive definite, T) is strictly positive definite, 
and thus invertible. For \ = 1, we recover the standard SV optimal hyperplane al- 
gorithm; other values of A determine the trade-off between invariance and model 
complexity control. 

By choosing the preprocessing matrix B according to (11.10), we obtain a formu- 
lation of the problem in which the standard SV quadratic optimization technique 
minimizes the tangent regularizer (11.4): the maximum of (7.17) subject to (7.18) 
and (7.19), using the modified dot product as in (11.2), coincides with the mini- 
mum of (11.4) subject to the separation conditions y; - @(x;) > 1, where g is defined 
as in (11.2). 

Note that preprocessing with B does not affect classification speed: since 
(Bx;,Bx;) = (xj, B'Bx;), we can precompute B' Bx; for all SVs x;, and thus ob- 
tain a machine (with modified SVs) which is as fast as a standard SVM. 

Let us now provide some interpretation of (11.10) and (11.7). The tangent vectors 
+2 |r=oLeX j have zero mean, thus T is a sample estimate of the covariance matrix 
of the random vector +2 |:-o£+x. Based on this observation, we call T the Tangent 
Covariance Matrix of the data set {x;|i = 1,...,m} with respect to the transforma- 
tions L+. 
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Being strictly positive definite, T can be diagonalized as T = UDUT, where the 
columns of the unitary matrix U are the eigenvectors of T, and the diagonal matrix 
D contains the corresponding positive eigenvalues. Then we can compute 


B=T7? =UD72U", (11.12) 


where D~? is the diagonal matrix obtained from D by taking the inverse square 
roots of the diagonal elements. Since the dot product is unitarily invariant (see 
Section B.2.2), we may drop the leading U, and (11.2) becomes 


g(x) => ay (D= U"x;,D7? UTx:) +b. (11.13) 
i=1 


A given pattern is thus first transformed by projecting it onto the eigenvectors of 
the tangent covariance matrix T, which are the rows of UT. The resulting feature 
vector is then rescaled by dividing by the square roots of the eigenvalues of T.” 
In other words, the directions of main variance of the random vector 2 |=02:x are 
scaled back, thus more emphasis is put on features which are less variant under 
Lr. This can be thought of as a whitening operation. 

For example, in image analysis, if the £; represent translations, more emphasis 
is put on the relative proportions of ink in the image rather than the positions of 
lines. The PCA interpretation of our preprocessing matrix suggests the possibility 
of regularizing and reducing dimensionality by discarding some of the features, 
as is common when doing PCA. As an aside, note that the resulting matrix will 
still satisfy (11.9).8 

Combining the PCA interpretation with the considerations following (11.2) 
leads to an interesting observation: the tangent covariance matrix could be ren- 
dered a task-dependent covariance matrix by computing it entirely from the SVs, 
rather than from the full data set. Although the summation in (11.7) does not take 
into account class labels y;, it then implicitly depends on the task to be solved, via 
the SV set. Thus, it allows the extraction of features which are invariant in a task- 


6. It is understood that we use T) if T is not strictly positive definite (cf. (11.11)) (for the 
concept of strict positive definiteness, cf. Definition 2.4 and the remarks thereafter). 

7. As an aside, note that our goal to build invariant SVMs has thus serendipitously pro- 
vided us with an approach for another intriguing problem, namely that of scaling: in SVMs, 
there is no obvious way of automatically assigning different weight to different directions 
in input space (see [102]) — in a trained SVM, the weights of the first layer (the SVs) form a 
subset of the training set. Choosing these Support Vectors from the training set only gives 
rather limited possibilities for appropriately dealing with different scales in different direc- 
tions of input space. 

8. To see this, first note that if B solves B'BTB'B = BB, and the polar decomposition 
of B is B = UB;, with UUT = 1 and B, = BJ, then B, also solves this expression. Thus, 
we may restrict ourselves to symmetrical solutions. For our choice B = 7-2, B commutes 
with T, hence they can be diagonalized simultaneously. In this case, BT B* = B? can also 
be satisfied by any matrix which is obtained from B by setting an arbitrary selection of 
eigenvalues to 0 (in the diagonal representation). 
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dependent way: it does not matter whether features for “easy” patterns change 
with transformations; it is more important that the “hard” patterns, close to the 
decision boundary, lead to invariant features. 


11.4.2 Invariance in Feature Space 


Let us now move on to the nonlinear case. We now enforce invariance in a slightly 
less direct way, by requiring that the action of the invariance transformations move 
the patterns parallel to the separating hyperplane; in other words, orthogonal to 
the weight vector normal to the hyperplane [467, 562]. This approach may seem 
specific to hyperplanes; similar methods can be used for kernel Fisher discriminant 
analysis (Chapter 15), however. In the latter case, invariances are enforced by 
performing oriented PCA [364, 140]. 

Let us now modify the analysis of the SV classification algorithm as described 
in Chapter 7. There, we had to minimize (7.10) subject to (7.11). When we want 
to construct invariant hyperplanes, the situation is slightly different. We do not 
only want to separate the training data, but we want to separate it in such a way 
that submitting a pattern to a transformation of an a priori specified Lie group 
(with elements £;,¢€ R) does not alter its class assignment. This can be achieved 
by enforcing that the classification boundary be such that group actions move 
patterns parallel to the decision boundary, rather than across it. A local statement 
of this property is the requirement that the Lie derivatives 2 | LXi be orthogonal 
to the normal vector w which determines the separating hyperplane in feature 
space. Thus we modify (7.10) by adding a second term to enforce invariance; 


1 m 


2 
nm =3(a- vd (w zl. ees) + Awl? (11.14) 


For A = 1, we recover the original objective function; for values 1 > A > 0, different 
amounts of importance are assigned to invariance with respect to the Lie group of 
transformations £+. 

The above sum can be rewritten as 


1 wu 2 1 wu 0 
m i ma (w Fle em) = mee (w Fle bm) (slow) 
= fic Tw), (11.15) 


where the matrix T is defined as in (11.7), 


1 5 0 T 
T= a S (Seles! ae xi) (|) (11.16) 


(if we want to use more than one derivative operator, we also sum over these; 
normalization may then be required). To solve the optimization problem, we 
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introduce a Lagrangian, 


L(w, b,a) = I ((1 — A) (w, Tw) + Al|w||?) — 5 ai (yi((x;,w) +b) — 1), (11.17) 
i=1 


with Lagrange multipliers a;. At the solution, the gradient of L with respect to w 
must vanish, 


(1—A)Tw+ Aw — ¥ aiyix; = 0. (11.18) 
i=1 

As the left hand side of (11.15) is non-negative for any w, T is a positive definite 

(though not necessarily strictly definite) matrix. It follows that for 


Ty :=(1- A)T +À1 (11.19) 


to be invertible, A > 0 is a sufficient condition. In this case, we get the following 
expansion for the solution vector; 


w= > ay:T, xi. (11.20) 
i=l 
Together with (7.3), (11.20) yields the decision function 


f(x) = sgn (5 aiyi (x, Ty'x:) + ) “ (11.21) 
i=1 


Substituting (11.20) into the Lagrangian (11.17), and given the fact that at the 
point of the solution, the partial derivative of L with respect to b must vanish 
(ity aiy: = 0), we get 


1 m = _ m 
W(a) = 5 2 Qiyi (1 "xi, (DTX) 2 awa) 
1= J= 


m m m 
— (š QiYiXi, I > awa) +? aj. (11:22) 
i=1 j=l i=1 


1 


By virtue of the fact that T}, and thus also Ty 1 is symmetric, the dual form of the 
optimization problem takes the form 


aks -e 1 E 
maximize W(a) = Xi 0i — 3 Eija jyy; Xi Ty Xi) 

subjectto a;>0, i=1,...,m, (11.23) 
and may; =0 


— we have thus arrived at the result of Section 11.4.1, using a rather different 
approach. The same derivation can be carried out for the nonseparable case, 
leading to the corresponding result with the soft margin constraints (7.38) and 
(7.39). 

We are now in a position to take the step into feature space. As in Section 7.4, 
we now think of the patterns x; as belonging in some dot product space H related 
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to input space by a map 


®:X 3H (11.24) 
Xi > X; = @(x;). (1125) 


Here X could be R^, or some other space that allows us to compute Lie derivatives. 

Unfortunately, (11.21) and (11.23) are not expressed in terms of dot products be- 
tween images of input patterns under ®. Hence, substituting kernel functions for 
dot products will not do. In addition, note that T, now becomes an operator in 
a possibly infinite-dimensional space, and is written T}. In this case, we can no 
longer easily compute the derivatives of patterns with respect to the transforma- 
tions. We thus resort to finite differences with some small t > 0, and define the 
tangent covariance matrix C as 


m 
T:= È (®(L:x)) — ®(x;)) (B(L:x;) — P(x;)) ". (11.26) 
For the sake of brevity, we omit the summands corresponding to derivatives in the 
opposite direction, which ensure that the data set is centered. For the final tangent 
covariance matrix T, these do not make a difference, as the two negative signs 
cancel out. 

We cannot compute T explicitly, but we can nevertheless compute (11.21) and 
(11.23). First note that for all x,x’ € RY, 


(®(x), TD) = (TO, T O0), (11.27) 


-1 
with T, ? being the positive definite square root of T\'. At this point, Kernel PCA 
(Section 1.7) comes to our rescue. As T} is symmetric, we may diagonalize it as 


T, = UDU'!, (11.28) 


where U is a unitary matrix (u'u = 1); hence, 


T}? = UDU". (11.29) 

Substituting (11.29) into (11.27), and using the fact that U is unitary, we obtain 

(alx), Ty'@(x')) = (uD-} UT (x), UD? UTE) (11.30) 
z (D-?UT®@), D-UT@(x')). (11.31) 


This, however, is simply a dot product between Kernel PCA feature vectors: 
UT®(x) computes projections onto eigenvectors of T, (i.e., features), and D~? 
rescales them. To get the eigenvectors, we carry out Kernel PCA on T}. Essen- 
tially, we have to go through the analysis of Kernel PCA using T, instead of the 
covariance matrix of the mapped data in H. We expand the eigenvectors as 


 :(@(L,x;) — (x), (11.32) 
i=1 


jj 
v= 
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and look for solutions of the eigenvalue equation uv = Tyv, with u > À (let us 
assume that A < 1; for A = 1, all eigenvalues are identical to À, the minimal 
eigenvalue). Note that if H is infinite-dimensional, then there are infinitely many 
eigenvectors with eigenvalue u = A. We cannot find all of these via Kernel PCA, 
hence we impose the restriction u > A. We shall return to this point below. 

These associated eigenvectors lie in the span of the tangent vectors. By analogy 
to (1.59), we then obtain 


mya = ((1—A)K:+Al)a, (11.33) 


where K; is the Gram matrix of the tangent vectors, computed using the same finite 
difference approximation for the tangent vectors as in (11.26) (see [467]); 


(Kij = (P(Lixi) — D(x), P(Lix;) — P(x))) (11.34) 


(for a different way of expressing K;, see Problem 11.4). To ensure that the eigen- 
vectors v* (11.32) are normalized, the corresponding expansion coefficient vectors 
ak have to satisfy 


jo (at, a) (11.35) 


where the ju; are those eigenvalues of (1 — A)K; + A1 that are larger than A (Prob- 
lem 11.6). Feature extraction is carried out by computing the projection of P(x) 
onto eigenvectors v; 


(v, (x)) = > Qk (P(Lix,) = P(x;)), @(x)) 
k=l 
= Y, alk(Lixe,x) — k(x, x)). (11.36) 
k=1 
What happens to all the eigenvectors of T} with eigenvalue A? To take these into 
account, we decompose T) into a part +1 — T, that has a rank of at most m, and 
another part i1 that is a multiple of the identity.? We thus find that the invariant 
kernel (11.27) can be written as 


(®(x), TZ (x) = 5 (B(x), B(x’)) — (210, (51 = 1] oh) . (11.37) 


For the first term, we can immediately substitute the original kernel to get ik(x, x’). 
For the second term, we employ the eigenvector decomposition (11.28), T, = 
UDU'. The diagonal elements of D are the eigenvalues of T), and satisfy Dj; > À 


for all i. Using the same eigenvectors, we can decompose +1 — Tj! as 

1 1 

41- T% =U |51- D™ } U". 11. 
beer =u(ta-07) ma 


The rank of this operator being at most m, it suffices to compute the columns of 


9. Thanks to Olivier Chapelle for useful suggestions. 
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U which correspond to the part of the space where the operator is nonzero. These 
columns are the normalized eigenvectors v',...,v4 (q < m) of T, that have eigen- 
values u > A, as described above. Let us denote the corresponding eigenvalues by 


M1, ++ +5 Hq 
We can thus evaluate the invariant kernel (11.37) as 


1 1 


q 
(®(x), TZ x) = k(x x!) — ¥ (v', ®(x)) G ) (v', B(x’), (11.39) 
i=l 


1 

Hi 
where the dot product terms are computed using (11.36). 

As in the linear case, we can interpret the second part of this kernel as a prepro- 
cessing operation. Since + — 7 > 0 for alli =1,...,9, we can take square roots, 
ending up with the preprocessing operation 


[Pom 0 


0 Ne 0 u’. (11.40) 


0 4/A7 = pp! 


As above, U! computes q eigenvector projections of the form (11.36). 

We conclude the theoretical part of this section by noting an essential difference 
between the approach described and that of Section 11.4.1, which we believe is an 
advantage of the present method: in Section 11.4.1, the pattern preprocessing is 
assumed to be linear. In the present method, the goal to get invariant hyperplanes 
in feature space leads to a nonlinear preprocessing operation. 


11.4.3 Experiments 


Let us now consider some experimental results, for the linear case. The nonlinear 
case has only recently been explored experimentally on real-world data, with 
promising initial results [100]. We used the small MNIST database described in 
Section A.1. We start by giving some baseline classification results. 

Using a standard linear SVM (that is, a separating hyperplane, Section 7.5), we 
observe a test error rate of 9.8%; by using a polynomial kernel of degree 4, this 
drops to 4.0%. In all of the following experiments, we used degree 4 kernels of 
various types. In a series of reference experiments with a homogeneous polyno- 
mial kernel k(x, x’) = (x, x’)*, using preprocessing with Gaussian smoothing ker- 
nels of standard deviation in the range 0.1,0.2,...,1.0, we obtained error rates 
which gradually increase from 4.0% to 4.3%. We conclude that no improvement 
of the original 4.0% performance is possible by a simple smoothing operation. 
Therefore, if our linear preprocessing ends up doing better, it is not due to a sim- 
ple smoothing effect. 

Table 11.4 reports results obtained by preprocessing all patterns with B (cf. 
(11.10)), with various values of A (cf. (11.11)). In the experiments, the patterns 
were scaled to have entries in [0,1], then B was computed, using horizontal and 
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Table 11.4 Classification error rates, when the kernel k(x, x’) = bea) is modified with 
1 


the invariant hyperplane preprocessing matrix B, = C}? ; cf. Eqs. (11.10)-(11.11). Enforcing 
invariance with A = 0.2,0.3,...,0.9 leads to improvements over the original performance 


(A = 1). 


x Joa [oa [os [oa [os [oe oF Pow] os TH) 


$5535 


Figure 11.5 The first pattern in the small MNIST database, preprocessed with B, = cy? 
(cf. equations (11.10)-(11.11)), with various amounts of invariance enforced. Top row: À = 
0.1,0.2,0.3,0.4; bottom row: A = 0.5,0.6,0.7,0.8. For some values of À, the preprocessing 
resembles a smoothing operation; preprocessing leads to somewhat higher classification 
accuracies (see text) than the latter, however. 


vertical translations (this was our choice of £;), and preprocessing was carried 
out; finally, the resulting patterns were scaled back again (for snapshots of the 
resulting patterns, see Figure 11.5). The scaling was done to ensure that patterns 
and derivatives lay in comparable regions of R^: the most common value of 
the derivative is 0, corresponding to the constant pattern background; this value 
should thus also be the most common value in the original patterns. The results 
show that even though (11.7) is derived for the linear case, it leads to slight 
improvements in the nonlinear case (for a degree 4 polynomial). 

The above [0, 1] scaling operation is affine rather than linear, hence the argument 
leading to (11.13) does not hold for this case. We thus only report results on 
dimensionality reduction for the case where the data are kept in [0,1] scaling 
during the whole procedure. 

We used a translation invariant radial basis function kernel (7.27) with c = 0.5. 
On the [—1, 1] data, for A = 0.4, this leads to the same performance as the degree 
4 polynomial; that is, 3.6% (without invariance preprocessing, meaning that for 
A = 1, the performance is 3.9%). To get the identical system on [0,1] data, the 
RBF width was rescaled accordingly, to c = 0.125. Table 11.5 shows that discarding 
principal components can further improve performance, up to 3.3%. 
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Table 11.5 Results obtained through dropping directions corresponding to small eigen- 
values of C, or in other words, dropping less important principal components (cf. (11.13)), 
for the translation invariant RBF kernel (see text). All results given are for the case \ = 0.4 
(cf. Table 11.4). 
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Jittered Kernel 


We now follow [134] in describing a method which, like the virtual SV method, 
applies transformations to patterns; this time, however, the transformations are 
part of the kernel definition. This idea is called kernel jittering [134]; it is closely 
related to a concept called tangent distance [496]. Loosely speaking, kernel jittering 
consists of moving around the inputs of the kernel (or, in [496], of a two-norm 
distance), using transformations such as translations, until the match is best. 

For any admissible SVM Gram matrix Kj; = k(x;, xj), consider a jittering kernel 
form K! į = KW (xi, xj), defined procedurally as follows: 


1. Consider all jittered forms of example x; (including itself) in turn, and select 
the one (x4) “closest” to xj; specifically, select x, to minimize the metric distance 
between x, and x; in the space induced by the kernel. This distance is given by 10 


Kgq — 2Kqj + Kjj (11.41) 
fs ia 
2. Let Ki = Kj- 


For some kernels, such as radial-basis functions (RBFs), simply selecting the 
maximum Kı j value to be the value for K! . suffices, since the Kag and K jj terms are 
constant in this case. This similarly holds for translation jitters, as long as sufficient 
padding exists so that no image pixels fall off the image after translations. In 
general, a jittering kernel may have to consider jittering one or both examples. 
For symmetric invariances such as translation, however, it suffices to jitter just one 
example. 

The use of jittering kernels is referred to as the JSV approach. A major motiva- 
tion for considering this approach is that VSV approaches scale at least quadrat- 
ically in the number (J) of jitters considered. This is because SVM training scales 
at least quadratically in the number of training examples, and VSV essentially ex- 


10. This corresponds to Euclidean distance in the feature space defined by the kernel, using 
the definition of the two-norm, 


[xi — x1? = 06,1) — 2 (xixi) + xx) 
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pands the training set by a factor of J. Jittering kernels are J times more expensive 
to compute, since each K! , computation involves finding a minimum over J com- 
putations of K;;. The potential benefit, however, is that the training set is J times 
smaller than methods such as VSV. Thus, the potential net gain is that JSV train- 
ing may only scale linearly in J, instead of quadratically as in VSV. Furthermore, 
through comprehensive use of kernel caching, as is common in modern practical 
SVM implementations, even that factor of J may be largely amortized away. 

As mentioned above, the kernel values induce a set of distances between the 
points (cf. footnote 10). For positive definite kernels, we know that the feature 
space has the structure of a dot product space, thus we obtain a valid metric in 
that space (see Section B.2.2). For jittering kernels, this is not necessarily the case; 
in particular, the triangle inequality can be violated. 

For example, imagine three simple images A, B, and C consisting of a single 
row of three binary pixels, with A =(1,0,0), B=(0,1,0), and C=(0,0,1). The minimal 
jittered distances (under 1-pixel translation) between A and B and between B and 
C are 0. However, the distance between A and C is positive (e.g., 1-2*0+1 = 2 for 
a linear kernel). Thus, the triangle inequality requirement of d(A, B) + d(B,C) > 
d(A,C) is violated in this example (cf. also [496]). Note that with a sufficiently 
large jittering set (such as the one including both 1-pixel and 2-pixel translations 
in the above example), the triangle inequality is not violated. 

In practice, violations tend to be rare, and are unlikely to present difficulties in 
training convergence (the SMO algorithm usually handles such cases; see [134)). 
Based on experiments with kernel jittering to date, it is still unclear how much im- 
pact any such violations typically have on generalization performance in practice. 

Jittering kernels have one other potential disadvantage compared with VSV 
approaches: the kernels must continue to jitter at test time. In contrast, the VSV 
approach effectively compiles the relevant jittering into the final set of SVs it 
produces. Nonetheless, in cases where the final JSV SV set size is much smaller 
than the final VSV SV set size, the JSV approach can still be faster at test time. 

In [134], some experiments are reported which compare VSV and JSV methods 
on a small subset of the MNIST training set. These experiments illustrate typical 
relative behaviors, such as the JSV test time often being almost as fast or faster than 
VSV (even though JSV must jitter at test time), due to JSV having many fewer final 
SVs. Furthermore, both JSV and VSV typically beat standard SVMs (which do not 
incorporate invariance). They also both typically beat query jitter as well, in which 
the test examples are jittered inside the kernels during SVM output computations. 
Query jitter effectively uses jittering kernels at test time, even though the SVM 
is not specifically trained for a jittering kernel. This case was tested simply as 
a control in such experiments, to verify that training the SVM with the actual 
jittering kernel used at test time is indeed important. 

While relative test errors between VSV and JSV vary, it does seem that VSV 
methods are substantially more robust than JSV methods, in terms of the general- 
ization error variance. 
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11.6 Summary 


Invariances can readily be incorporated in Support Vector learning machines, by 
generating virtual examples from the Support Vectors, rather than from the whole 
training set. The method yields a significant gain in classification accuracy for 
a moderate cost in time: it requires two training runs (rather than one), and it 
constructs classification rules utilizing more Support Vectors, thus slowing down 
classification speed (cf. (7.25)). Given that Support Vector Machines are known to 
exhibit fairly short training times (as indicated by the benchmark comparison of 
[64], cf. also Chapter 10), the first point is usually not critical. Certainly, training 
on virtual examples generated from the whole database would be significantly 
slower. To compensate for the second point, we can use reduced set methods 
(Chapter 18). 

As an alternative approach, we can build known invariances directly into the 
SVM objective function via the choice of kernel. With its rather general class of 
admissible kernel functions, the SV algorithm provides ample possibilities for 
constructing task-specific kernels. The method described for constructing kernels 
for transformation invariant SVMs (invariant hyperplanes) has so far only been 
applied to real world problems in the linear case (for encouraging toy results 
on nonlinear data, cf. [100]), which probably explains why it is only seen to 
lead to moderate improvements, especially when compared with the large gains 
achieved by the Virtual SV method. The transformation invariant kernel method 
is applicable to differentiable transformations — other types, such as those for 
mirror symmetry, have to be dealt with using other techniques, such as the Virtual 
Support Vector method or jittered kernels. Its main advantages compared with 
the latter techniques are that in the linear case, it does not slow down testing 
speed, and that using more invariances leaves training time almost unchanged. In 
addition, it is rather attractive from a theoretical point of view, as it establishes a 
surprising connection to invariant feature extraction, preprocessing, and principal 
component analysis. 

Although partly heuristic, the techniques in this chapter have led to the record 
result on the MNIST database. In addition, SVMs present clear opportunities 
for further improvement. More invariances (in the pattern recognition case, for 
instance, small rotations, or varying ink thickness) could be incorporated. Further, 
we might use only those Virtual Support Vectors which provide new information 
about the decision boundary, or use a measure of such information to keep only 
the most important vectors. Finally, if locality-improved kernels, to be described 
in Section 13.3, prove to be as useful on the full MNIST database as they are on 
subsets of it, accuracies could be substantially increased — admittedly at a cost in 
classification speed. 

We conclude this chapter by noting that all three techniques described should be 
directly applicable to other kernel-based methods, such as SV regression [561] and 
Kernel PCA (Chapter 14). Note, finally, that this chapter only covers some aspects 
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of incorporating prior knowledge, namely about invariances. Further facets are 
treated elsewhere, such as semiparametric modelling (Section 4.8), and methods 
for dealing with heteroscedastic noise (Section 9.5). 


11.1 (VSVs in Regression e) Apply the VSV method to a regression problem. 


11.2 (Examples of Pattern Invariance Transformations e) Try to think of other pat- 
tern transformations that leave class membership invariant; for instance, in the case of 
image processing: translations, brightness transformations, contrast transformations, ... 


11.3 (Task-Dependent Tangent Covariance Matrix ee) Experiment with tangent co- 
variance matrices computed from the SVs only. Compare with the method of [364]. 


11.4 (Alternative Expression of the Tangent Vector Gram Matrix [100] ee) Prove 
that under suitable differentiability conditions, the elements (11.34) of the tangent vector 
Gram matrix can be written as a quadratic form in terms of the kernel Hessian, 


klx, x; 
(K,);j = lim («ess —x)7 Soe (ajx D) , (11.42) 


Hint: note that the Hessian can be written as 
OR xj) ð ð 
> = a l) ). 11.43 
OxjOX; (x ( D1 By, ( D) ( ) 
11.5 (Regularizing the Tangent Covariance Matrix ee) Discuss the difference be- 
tween adding A1 to the tangent covariance matrix in the linear case, and in the nonlinear 
case (Section 11.4). Note that in the nonlinear case, A1 only has an effect in the span of the 
training patterns. 
Rather than using the decomposition (11.37), try to deal with this problem by including 
other patterns in the expansion. Perform experiments to test your approach. 
11.6 (Eigenvectors of the Regularized Tangent Covariance Matrix e) Suppose v* 
is an eigenvector of (1 — A)K; + A1 with eigenvalue ju, > A. Prove that to ensure 
(vk vt) = 1 in H, the coefficient vector ox has to satisfy (11.35). 


11.7 (Nonlinear Invariant Kernels ee) Implement the kernel (11.39) for a visual pat- 
tern recognition problem, using small image translations to compute the approximate tan- 
gent vectors (cf. (11.26)). Apply it and study the effect of A and of the size of the image 
translations. 


11.8 (Scaling of Input Dimensions eee) Consider kernels of the form kp(x,x’) = 
k(Dx, Dx’), where D is a diagonal scaling matrix. How can you use prior knowledge 
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to choose D? Can you devise a method which will drive a large fraction of the diago- 
nal entries of D to 0 (“input feature selection”)? Discuss the relationship to Automatic 
Relevance Determination and Relevance Vector Machines (Chapter 16). 
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Overview 


Prerequisites 
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Chapter 5 mainly dealt with the fundamental problem of uniform convergence 
and under which conditions it occurs, based on a VC perspective. The present 
chapter takes a slightly more practical approach by proving that certain estimation 
methods such as the minimization of a regularized risk functional or the leave- 
one-out error are, indeed, reliable estimators of the expected risk. Furthermore, 
it shows alternative means of determining the reliability of estimators, based on 
entropy numbers of compact operators associated with function classes, and the 
Kullback-Leibler divergence between prior and posterior distributions. 

The chapter is organized as follows. In Section 12.1 we will introduce the notion 
of a concentrated random variable and state a concentration of measure inequality 
of McDiarmid. This notion will become useful to prove that for minimizers of the 
regularized risk functional the empirical risk is a not too unreliable estimator of the 
expected loss. A similar fact holds also for the leave-one-out estimator, discussed 
in Section 12.2. This estimator also enjoys other good properties, such as being 
unbiased, its computation, however, is rather expensive. For this reason, we give 
three methods of approximating this estimator, ranging from an O(d) to an O(d*) 
method, where d is the number of nonzero expansion coefficients. 

Further ways of assessing the generalization performance of an estimator are 
explained in Sections 12.3 and 12.4. The first of these methods is based on the 
concept of a distribution over classifiers much akin to a Bayesian estimator. The 
second, in turn, takes an operator theoretic approach to measure the capacity of 
the function class described by a kernel estimator explicitly. In short, it derives a 
more fine grained measure of capacity than the VC dimension. 

It is useful if the reader has some familiarity with the contents of Chapter 5, how- 
ever it is not essential for an understanding of the Sections 12.1, 12.2, 12.3. Only 
Section 12.4 assumes familiarity with the techniques underlying the derivation of 
a VC Bound, as described in Section 5.5. 

Knowledge of the theory of Reproducing Kernel Hilbert Spaces (Section 2.2.2) is 
required for some of the proofs of Section 12.1 and 12.2. A good working knowl- 
edge of matrix algebra is useful in Section 12.2, and before reading Section 12.3 we 
recommend that the reader become familiar with the basics of Bayesian inference, 
as described in Section 16.1. Finally, Section 12.4 will most likely be difficult for 
readers not familiar with notions of functional analysis. 
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12.1 Concentration 12.3 PAC—Bayesian 
and Stability 


16 Bayesian Inference 


12.2 Leave one out | 


Estimator 12.4 Entropy and 
Covering Numbers 
l 12.2.3 Span Bound ) 12.2.5 Statistical 5.5 Building a 
Mechanics VC-Style Bound 
X J 


12.1 Concentration of Measure Inequalities 


McDiarmid’s 
Bound 


Most of the reasoning presented in this book is concerned with minimizing a 
regularized risk functional Ryeg[f]. This generates a very specific type of statistical 
estimator and it may be expected that we could find some uniform convergence 
bounds that take advantage of the fact that it is not an arbitrary estimator we are 
dealing with. In what follows we use a slight extension of an idea by Bousquet and 
Elisseeff [67], originally described in [160], to provide such a bound. It is based on 
the concentration of measure inequalities. 

These are a family of theorems which state that certain random variables € 
are concentrated, which means that, with high probability, the values of random 
draws of € are very close to their expected values E [€]. In this case we say that the 
distribution is concentrated around its mean. We have already encountered one of 
these cases in Theorem 5.1, where we saw that the average over a set of bounded 
random variables is, with high probability, close to the expected average. 

The concept of concentration is, however, much more general than just that. It 
deals with classes of functions 9(&,...,&) of random variables (£1, ..., Em) which 
have the property of being concentrated. For instance, if the influence of each £; 
on g is limited, g is not likely to vary much. What we show in the following is that 
the minimizers of the regularized risk functional enjoy this property. 


12.1.1 McDiarmid’s Bound 
Roughly speaking, McDiarmid’s bound states that if arbitrary replacements of 


random variables é; do not affect the value of g excessively, then the random 
variable 9(&,...,&n) is concentrated. 


Theorem 12.1 (McDiarmid [356]) Denote by €),...,&n tid random variables and as- 
sume that there exists a function g : €" — R with the property that for alli € [m] (we use 
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Uniform Stability 


the shorthand [m] := {1,...,m}), and c; > 0, 


sup I8 (E, +- -3 Em) —B Cig -+ -p bay Ejs Eitt, <- -3 Sn) | < Ci (12.1) 
Et geegEn E EE 


where €! is drawn from the same distribution as £;. Then 


i= 


2 
P{s(Gis-+-.n) = EGE ED > E} < ep (- 53). (122) 


This means that a bound similar to the law of large numbers can be applied to any 
function g which does not overly depend on individual samples €;. Returning to 
the example from the introduction, if we define 


lEn,- -s Em) := D (12.3) 
i=1 


where €; € [a, b], then, clearly, c; = +(b — a) since f can only change by this amount 


if one particular sample is replaced by another. This means that the rhs of (12.2) 


becomes 2 exp (- wits ) ,in other words, we recover Hoeffding’s bound (Theorem 


5.1) as a special case. See also [137] for details. 


12.1.2 Uniform Stability and Convergence 


In order to apply these bounds to learning algorithms we must introduce the 
notion of uniform stability. This is to determine the amount by which an estimate 
f :X— Y based on the training data Z := {(x1, y1),.--, (Xm, Ym) } C X x Y changes 
if we change one of the training patterns. 


Definition 12.2 (Uniform Stability) Denote a training sample of size m by Z. More- 
over, denote by Z' := (Z\{z;}) U {z} (where z := (x, y)) the training sample where the ith 
observation is replaced by z. Finally, denote by fz the estimate produced by our learning 
algorithm of choice (and likewise by fz; the estimate based on Z'). We call this mapping 
Z — fz uniformly G-stable with respect to a loss function c if 


lc (x,y, fax) — c (x, y, fa) | < 8 for all (x,y) € X x Y, all Z, and alli. (12.4) 


This means that the loss due to the estimates generated from Z, where an arbitrary pattern 
of the sample has been replaced, will not differ anywhere by more than 3. 


As we shall see, the notion of uniform stability is satisfied for regularization net- 
works of different types, provided that the loss function c is Lipschitz continuous 
(provided that c has bounded slope). The following theorem uses Theorem 12.1 to 
prove that -stable algorithms exhibit uniform convergence of the empirical risk 
Remp[ f] to the expected risk R[f]. 
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Theorem 12.3 (Bousquet and Elisseeff [67]) Assume that we have a (3-stable algo- 
rithm with the additional requirement that fz(x) < M for all x € X and for all training 
samples Z C X x Y. Then, for m > M, we have, 


2 
P {|Remplfzl— RIfz]| > e} < EHE (12.5) 
and for any m > 1 
2 
P {|Remp[fz] — RLfz]| > e+ 8} < 2exp (r) : (12.6) 


This means that if 6 decreases with increasing m, or, in particular, if 8 = OGa-"), 
then we obtain bounds that are optimal in their rate of convergence, specifically, 
bounds which have the same convergence rate as Hoeffding’s bound (5.7). 

To keep matters simple, we only prove (12.6). The details for the proof of (12.5), 
which is rather technical, can be found in [160]. 


Proof We first give a bound on the expected difference between Remp[fz] and 
R[fz] (hence the bias term) and subsequently will bound the variance. This leads 
to 


|Ez [RempL fz] a RIfz]] | = Ezz È ¥ cls, Yi, fz(Xi)) = c(x, vzo] | (12.7) 


i=1 


<B (12.8) 


= |Ez È S esy, fz) —c(x, y, fz(x)) 


i=1 


The last equality (12.8) followed from the fact that, since we are taking the ex- 
pectation over Z,z, we may as well replace z; by z in the terms stemming from 
the empirical error. The bound then follows from the assumption that we have a 
uniformly {-stable algorithm. 

Now that we have a bound on the expectation, we deal with the variance. 
Since we want to apply Theorem 12.1, we have to analyze the deviations of 


(Rempl fz] = R[fz]) from (Remp fz] = R[fzi)). 


| (Remp [ P [fz]) — (Remp [fz] -R [fz] < (12.9) 
IR [fz] -R [fz]| + [Remp [fz] — Remp [fz']| < (12.10) 
B+—|c (xi, Yi, fz(xi)) —  ( x Y, fz(x)) | 
+45 enyn) -e c(x;,¥j,fz(x))|< 8+ +8 (12.11) 
jži 


Here (12.10) follows from the triangle inequality and the fact that the learning 
algorithm is {-stable. Finally, we split the empirical risks into their common parts 
depending on Z’ and the remainder. From (12.11) it follows that c; = 2 Bm as 


required by Theorem 12.1. This, in combination with (12.8), completes the proof. 
a 


12.1 Concentration of Measure Inequalities 363 


12.1.3 Uniform Stability of Regularization Networks 


We next show that the learning algorithms we have been studying so far actually 
satisfy Definition 12.2 and compute the corresponding value of /3. 


Theorem 12.4 (Algorithmic Stability of Risk Minimizers) The algorithm minimiz- 
ing the regularized risk functional Rreg 


Riegl f1 = Remplf1+ ŽI = È cen ve food) + NIP (12.12) 
i=1 


2,2 
has stability 8 = ae , where «k is a bound on ||k(x, -)|| = W/k(x, x), c is a convex loss 


function, ||- || is the RKHS norm induced by k, and C is a bound on the Lipschitz constant 
of the loss function c(x, y, f (x)), viewed as a function of f(x). 


Since the proof is somewhat technical we relegate it to Section 12.1.4. Let us now 
discuss the implications of the theorem. 

We can see that the stability of the algorithm depends on the regularization 
constant via x, hence we may be able to afford to choose weaker regularization if 
the sample size increases. For many estimators, such as Support Vector Machines, 
we use a constant value of C = 5... In the context of algorithmic stability this means 
that we effectively use algorithms with the same stability, regardless of the sample 
size. 

The next step is to substitute the values of into (12.6) to obtain practically 
useful uniform convergence bounds (to keep matters simple we will only use 
(12.6) of Theorem 12.3). It is straightforward to obtain corresponding statements 
for (12.5) (see Problem 12.1). The following theorem is a direct consequence of 
Theorems 12.3 and 12.4. 


Theorem 12.5 (Uniform Convergence Bounds for RKHS) Given an algorithm min- 


imizing the regularized risk functional, as in (12.12), with the assumptions of Theorem 
12.4 we obtain 


=) 
P {|Rempl fz] — RLfz]| > € +8} < 2exp (- (=) (1 + Z (Ch) ) ) (12.13) 


where p = e, The E term in the exponent stems from the fact that, in order 
to make a statement concerning a certain precision €, we have to take the total 
scale M of the function values and loss functions into account. The (Ck)? term 
determines the effective dynamic range of the function class, namely by how much 
the functions may change. AM specifies the effective regularization strength; how 
much simple functions are preferred with respect to the full range M of the loss 
function c. 

Finally, we can see that for fixed A the rate of convergence of the empirical risk 
Rempl fz] to R[ fz] is given by exp(—com), which is identical to the rates given by 
Hoeffding’s bound (5.7). Note, however, that in (5.7) the constant is 2, whereas 
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in the present case co may be significantly smaller. Thus, it seems as if the cur- 
rent bounds are much worse than the classical VC-type bounds (see Section 5.5) 
which essentially scale with N(F,m, €) exp(—2me?M~’) [574]. Yet this is not true in 
general; usually the covering number N(F, m, £) (see Section 12.4 for more details) 
grows with the sample size m which counteracts the effect of the exponential term, 
exp(—2me?M~*). 

Furthermore, for practical considerations, (12.13) may be very useful, even if 
the rates are not optimal, since the bound is predictive even for small sample 
sizes and moderate regularization strength. Still, we expect that the constants 
could be improved. In particular, instead of 4, we suspect that 2m would be more 
appropriate. 

In addition, note that the current bound uses a result concerning worst case 
stability. For most observations such as non-Support Vectors, however, changing a 
single pattern is unlikely to change the estimate at all. We suspect that the average 
stability of the estimator is much higher than what has been shown so far. It is (still) 
an open problem how this goal can be achieved. Finally, the bound does not take 
the specific form of the RKHS + into account, but instead assumes that the space 
is completely spherical. This leaves room for further theoretical improvement of 
the bound. Let us proceed to a practical example. 


Corollary 12.6 (Gaussian Kernel SV Regression with ¢-loss) In the case of SV Re- 
gression with Gaussian RBF kernels and the e-insensitive loss (1.44) the risk of deviation 
between empirical and expected risk is given by 


=\2 -2 
$ m [{ é 2 
P {|Remp[fz] — RLfz]| > 2+ 8} <2exp (z (=) (1 + ai) (12.14) 
where M denotes an upper bound on the loss and X is a regularization parameter. 
This is since « = 1 for Gaussian RBF kernels (here k(x, x) = 1) and, further, the loss 
function (1.44) has bounded slope 1, thus also C = 1. 
12.1.4 Proof of Theorem 12.4 


We require an auxiliary lemma. 


Lemma 12.7 (Convex Functions and Derivatives) For any differentiable convex func- 
tion f : R > Rand any a,b € R we have 
(f'a) — f'a — b) > 0. (12.15) 


Proof Due to the convexity of f we know that f(a) + (b —a)f/(a) < f(b) and, 
likewise, f(b) + (a — b) f'(b) < f(a). Summing up both inequalities and subtracting 
the terms in f(a) and f(b) proves (12.15). a 


Proof of Theorem 12.4. We must extend the notation slightly insofar as we will 
explicitly introduce the dependency on the data in the empirical risk functional. 
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This simply means that, instead of writing Rreg[f], we will use Rreglf, Z] and 
Rregl f, Z'] (and likewise Remplf, Z]) for the remainder of the proof in order to 
distinguish between different training sets. 
Recall that fz minimizes Rreg[f , Z], that is, the functional derivative of Rreg[ f, Z] 
at fz vanishes, 
OfRregl fz, Z] = OfRempL fz, Z] + Afz = 9, (12.16) 
OfRreglf zi, Z |= Of RempLfzi; Z'| + Afi =, (12.17) 


Next, we construct an auxiliary risk function R[f] by 


RIAI = (OpRempl f2»Z1 — OjRem [fz 2"] .f — far) +È f- fal? (12.18) 


Clearly R[f] is a convex function in f (the first term is linear, the second quadratic). 
Additionally, by construction 


RUfzi] = 0. (12.19) 


Furthermore, the minimum of R[f] is obtained for f = fz. One can see this by 


taking the functional derivative of R[f] to find 
ORIF] = OfRempl fz, Z]—OfRemp [fz Z] + A (f — fz) = OfRempl fz, ZI—Af (12.20) 


Eq. (12.20) vanishes for f = fz due to (12.16). From (12.19) we therefore conclude 
that R[ fz] < 0. In order to obtain bounds on || fz — fzi||, we have to get rid of some 
of the first terms in R[f]. We observe 


m (0fRempl fz, Z] — OfRemp [fz Z]; fz — fz) (12.21) 
= > (e (xyi fz E) =e (xis yj Fe: ))) Oza- fz) 
JF! 


+e! (xi, Yi, fz (xi)) Fz) — fz) = e (x, y, fz 0) fzo) fz) (12.22) 
> e (xi, Yi, fz (xi) (fz(xi) — fzi(xi)) =e (x, y fz (x)) (fz(x) — fzi(x)) ` (12.23) 


In order to obtain (12.22) we use the same techniques as exploited in the proof 
of Theorem 4.2, in particular that Ofc(x, y, f(x)) = c'(x, y, f(x))k(x, -). Collecting 


common terms between Remp[f, Z] and Remp[f, Z'] leads to the result. For (12.23) 
we use Lemma 12.7 applied to the loss function c(x, y, f(x)) which is a convex 


function of f(x). Combining (12.23) with R[ fz] < 0 gives 
0> e (xi, Yi, fz) Fz) — fz) — e (x,y, fz) Fz — fz) 
+l- fal. (12.24) 


Since the norm of the derivative of the loss function |c'(x, y, f(x))| is bound by C 
and |fz(x)|, |fzi| < M we have 


mA fs — fall 
<o (x,y, fzi(x)) (Fax) — fz) = e (xi, yi fz) Fz) — fz) (12.25) 
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Finally, we must convert our result regarding the proximity of fz and fz; in H 
to a statement regarding the corresponding values of the loss functions. By the 
Cauchy-Schwarz inequality we can see that, for any f, f’ € H and any x € X 


LAC) — FOL =F FRG YTS MLA FMA OILS IF = e'I- (12.26) 
Since c is Lipschitz continuous this leads to 


lolx, Fe) — cæ, y, Fol < CI — foil < Call fz — fall < Cry EE. 02.27) 


Using (12.27) in (12.25) yields 


mÀ 
> Ilfz — fall? < Crllfz- fall (12.28) 
and, thus, || fz — fzi|| < z a this into (12.27) proves the claim by 
Ki 
le(x,y; fz) —e(x, y, fzi)| < mA ` a 


An extension to piecewise convex loss functions can be achieved by replacing the 
derivatives by subdifferentials. Most equalities for optimality conditions become 
statements about memberships in sets. The auxiliary lemma can be analogously 
extended, and the overall theorem stated in terms of an upper bound on the values 
of the subdifferentials of the loss function. Since this would clutter the notation 
even further, we have refrained from that. A second way of proving the theorem 
for arbitrary convex loss functions uses Legendre transformations of the empirical 
risk term. See [615] among others, for details. 


12.2 Leave-One-Out Estimates 


Rather than betting on the proximity between the empirical risk and the expected 
risk we may make further use of the training data and compute what is commonly 
referred to as the leave-one-out error of a sample. The basic idea is that we find an 
estimate fÍ from a sample consisting of m — 1 patterns by leaving the ith pattern 
out and, subsequently, compute the error of misprediction on (xj, y;). The error is 
then averaged over all m possible patterns. The hope is that such a procedure will 
provide us with a quantity that is very closely related to the real expected error. 


12.2.1 Theoretical Background 
Before we delve into the practical details of estimating the leave-one-out error, we 


need a formal definition and have to prove that the leave-one-out estimator is a 
useful quantity. 
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Leave-One-Out 
Error 


Unbiasedness of 
Leave-One-Out 


Definition 12.8 (Leave-One-Out Error) Denote by f z the estimate obtained by a learn- 
ing algorithm, given the sample Z, by Z! := Z\{(x;, y:)} the sample obtained by removing 
the ith pattern, and by fz: the corresponding estimate, obtained by the same learning algo- 
rithm (note that we changed the definition of Z' from that in the previous section). Then 
the leave-one-out error is defined as 


m 


Rroo(Z) := c(Xi, Yi, fzi(Xi)). (12.29) 
=1 


1 
Mei 
The following theorem by Luntz and Brailovsky [335] shows that Ryoo(Z) is an 
almost unbiased estimator.! 


Theorem 12.9 (Leave-One-Out Error is Almost Unbiased [335]) Denote by P a dis- 
tribution over X x Y, and by Zm and Zm—ı samples of size m and m — 1 respectively, 
drawn iid from P. Moreover, denote by RI fz,,_,] the expected risk of an estimator derived 
from the sample Zm—1. Then, for any learning algorithm, the leave-one-out error is almost 
unbiased, 


Ez, [RL fz,.-1]| = Ez, [Ruoo(Zm)] - (12.30) 


Proof We begin by rewriting Ez,,_, [R[fz,,_,]] in terms of expected values only. By 
definition (see (3.12)) R[f] := E [c(x, y, f(x))] and, therefore, the lhs of (12.30) can 
be written as 


Ez,,-1 [RLZ,-1 i = Ez,,_:U{(x,y)} [c(x, Y, fz00) s (12:31) 


The leave-one-out error, on the other hand, can be restated as 


1 m 
Ez, [Rtoo(Zm)] = a DEz,, [ei yi fz œ] 
a 


= Ez, 1U{(m Ym) } [elms Ym ? IA (Xm))] (12.32) 


Here we use the fact that expectation and summation can be interchanged. In 
addition, a permutation argument shows that all terms under the sum have to be 
equal, hence we can replace the average by one of the terms. Finally, if we rename 
(Xm, Ym) by (x, y), then (12.32) becomes identical to the rhs of (12.31) which proves 
the theorem. o 


This demonstrates that the leave-one-out error is a sensible quantity to use. We are 
short, however, of another key ingredient required in the use of this method when 
bounding the error of an estimator; we need a bound on the variance of Rroo(Z). 
While general results exist, which show that the leave-one-out estimator is not a 
worse estimate than the estimate based on the empirical error (see Kearns [285] for 
example, who shows that at least the rate is not worse), we would expect that, on 
the contrary, the leave-one-out error is much more reliable than the empirical risk. 


1. The term “almost” refers to the fact that the leave-one-out error provides an estimate for 
training on sets of size m — 1 rather than m, cf. Proposition 7.4. 
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We are not aware of any such result in the context of minimizers of the regu- 
larized risk functional (see [136] for an overview of bounds on the leave-one-out 
error). In the following we state a result which is a slight improvement on Theorem 
12.3 and which uses the same concentration of measure techniques as in Section 
12.1 (see also [68]). 


Theorem 12.10 (Tail Bound for Leave-One-Out Estimators) Denote by A a -stable 
algorithm (for training set of size m — 1) with the additional requirement that 0 < A(Z) < 
M for all z € X x Y and for all training samples Z C X x Y. Then we have; 


2me? 
P {|Rtoo(Z) — Ez[Rtoo(Z)]| > e} <2exp O) . (12.33) 


Proof The proof is very similar to that of Theorem 12.3 and uses Theorem 12.1. 
All we must do is show that Rroo(Z) does not change by too much if we replace 
one of the patterns in Z by a different pattern. This means that, for ZÍ := Z\z; U {z} 
(where z := (x, y)), we have to determine a constant cy such that 


|Rroo(Z) — Rroo(Z’)| < co for all i. (12.34) 


In the following we denote by f. } (and f A respectively) the estimate obtained when 
leaving out the jth pattern. We may now expand (12.34) as follows. 


|Rroo(Z) = Rioo(Z')| < le (xi, yi FL) —c mA 
jAi 
L |e (xe yin fhe) — € (ey. Fae) (12.35) 
Pe (12.36) 


i#i 
In (12.36) we use the fact that we have a (-stable algorithm, hence the individ- 
ual summands are bounded by 8. In addition, the loss at arbritrary locations is 
bounded from above by M (and by 0 from below), hence two losses may not differ 
by more than M overall. This shows that co < 3 + “4. Substituting this into (12.2) 
proves the bound. a 


We may use the values of 6 computed for minimizers of the regularized risk 
functional (Theorem 12.4 and Corollary 12.6) in order to obtain practical bounds. 
The current result is an improvement on the confidence bounds available for 
minimizers of the regularized risk functional (there is no dependency on £ in the 
confidence bound and the constants in the exponential term are slightly better). 
One would, however, suspect that much better bounds should be possible. 

In particular, rather than bounding each individual term in (12.35) by £, it 
should be possible to take advantage of averaging effects and, thus, replace the 
overall bound 6 by J for example. It is an open question whether such a bound 
can be obtained. 
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Also note that Theorem 12.10 only applies to Lipschitz continuous, convex loss 
functions. This means that we cannot use the bound in the case of classification, 
since the loss function is discontinuous (we have 0-1 loss). Still, the leave-one- 
out error turns out to be currently the most reliable estimator of the expected error 
available. Hence, despite some lack of theoretical justification, one should consider 
it a method of choice when performing model selection for kernel methods. 

This brings us to another problem; how should we compute the leave-one-out 
error efficiently, without running a training algorithm m times? We must find 
approximations or good upper bounds for Rroo(Z) which are cheap to compute. 


12.2.2 Lagrange Multiplier Estimates 


One of the most intuitive and simplest bounds on the leave-one-out error to 
compute is one based on the values of the Lagrange multipliers [259]. In its original 
form it was proven for regularized risk functionals without constant threshold b, 
so, for f € H. It can be stated as follows. 


Theorem 12.11 (Jaakkola and Haussler [259]) Denote by Z C X x Y a training set 
for classification, and by f(x) = Yih, aiyik(x;, x) the minimizer of the regularized risk 
functional (7.35). Then an upper bound on the leave-one-out error is 


Ryoo(Z) < > O (yi (F(x) = aik(xi, xi))) , (12.37) 


where @ is the step function. 


Proof Recall (7.37). This is the special case of a convex optimization problem of 
the following form; 


m 


minimize D(a) := (ai) + = aja k(x;, x;) 
> z Èo Vidi (12.38) 


subjectto O<aj;<C a alli € [m] 


Here y(a;) is the term stemming from the loss function (in SV classification y(a;) = 
ai). Moreover, denote the restriction of D into the set Z\{(x;, y:)} by D'(a). Denote 
by a* € R" the minimizer of (12.38) and by a! € R"~! the minimizer of the 
corresponding problem in D’. Finally, denote the restriction of a* onto R” obtained 
by removing the ith component by ā* (and likewise f*, f’, f’). 

By construction D'(a') < D'(@*). We modify D' slightly such that the changed 
version has @* as its minimum. One may check that 
Dia) :=Di(a)+yia? $, yjajk(xi,x;) = D (a) + yia} lf (i) — aiyik(xi, x)))(12.39) 

jal i#i 
satisfies this property. Thus, we have D'(a') > D'(a*). Expanding terms we obtain 
Dia) — Dia’) < Dia’) — Dia’) = yio Y, yjoik(x;,x)) = yio f(x). (12.40) 
= 1,jži 
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Since a¥ > 0 for all i € [m] this implies 
yi (xi) — OF K(x;, xi) < yif x). (12.41) 


As a leave-one-out error occurs exactly if y;fi(x;) < 0 this means that y;(f(x;) — 
ažk(x;, x;)) < 0 is a necessary condition for a leave-one-out error and, thus, can be 
used as an upper bound on it. a 


Note that we cannot directly apply this bound to the classical SV setting since 
there f(x) = (w, ®(x)) + b, specifically, there exists a parametric constant term 
b which is not regularized at all. Joachims [268] shows that Theorem 12.11 can 
be modified in such a way to accommodate for this setting, namely replacing 
f (xi) — aik(xi, xi) by f (xi) — Zaik(x;, xi). Practical experiments show that this bound 
is overly conservative [268]. In fact, for all practical purposes, Theorem 12.11 seems 
to be predictive enough, even in the case of a constant threshold. 

Additionally, (12.37) motivates a modified SV training algorithm [592] by di- 
rectly minimizing the bound of Theorem 12.11, 


m 
minimize = Ej 
i=1 


subject to yi’, ajyjk(xi, xj) > 1— & for all i € [m] (12.42) 
[Ai 


ai, ĉi > 0 for alli € [m]. 


Here we choose a fixed constant for the margin to ensure non-zero solutions. It 
appears that an algorithm which minimizes (12.42) does not have any free pa- 
rameters. Unfortunately this is not quite true. Weston and Herbrich [592] modify 
(12.42) to regularize the setting by replacing the inequality constraint by 


yi >, jy jk(xi, xj) + Aaiyik(xi, xi) > 1 — & for alli € [m] (12.43) 
iti 


where A € [0,1]. Unfortunately the effect of A is not so easily understood. See the 
original work [592] for more details, or Problem 12.2 for an alternative approach 
to fixing the parametrization problem via the v-trick. Let us now proceed to a 
more accurate bound on the leave-one-out error, this time computed by using the 
distribution of the Support Vectors. 


12.2.3 The Span Bound for Classification 


Vapnik and Chapelle [565] use a quantity, called the span of the Support Vectors, 
in order to provide an upper bound on the number of errors incurred by a SVM. 
The proof technique is quite similar to the one in the previous section. The main 
difference is that the Lagrange multipliers are adapted in such a way as to accomo- 
date the constant threshold and the fact that only a subset of patterns are chosen as 
Support Vectors. Consequently, we obtain a more complicated (and possibly more 
precise) bound on the leave-one-out error. 

In what follows, we state the main theorems from [565] without proof (see the 
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In-Bound SVs 


Span 


original publication for details). In addition, we adapt the results to quantile esti- 
mation and novelty detection (see Chapter 8 for the description of the algorithm) 
and prove the latter (the proofs being similar, one can easily infer the original from 
the modification). Before we proceed, we recall the notion of in-bound Support Vec- 
tors. As introduced in Table 7.2, these are the SVs x; whose Lagrange multipliers 
lie in the interior of the box constraints; using the present notation, 0 < aj < C. 


Definition 12.12 (Span of a SV Classification Solution) Denote by X,Y the train- 
ing sample and by a1,...,Qm, b the solution obtained by solving the corresponding SV 
soft-margin classification problem with upper bound C on the Lagrange multipliers (see 
Section 1.5 and Chapter 7). Furthermore, denote by ay,..., Qn the in-bound SVs (with- 
out loss of generality we assume that they are the first n patterns of the sample). Then the 
span Saass(1) of an SV classification solution with respect to the pattern | is defined by 


Srass(l) = min ©, BiBjk(xi,x)), (12.44) 
BEB; i,j=l 

where 

Br = {8 | Bi =—-1,¥ A =0, and 0 < aj + yiyreufi < Ch. (12.45) 


= 


Note that X}; 6;8;k(xi,x;) = || Ly Gi(x:)|? and, in particular, that S3,,..(/) is 
the minimum distance between the patterns ®(x;) and a linear combination of 
the remaining in-bound Support Vectors which leaves the box and inequality 
constraints intact.? This is a measure for how well ®(x;) can be replaced by the 
remaining in-bound SVs. 

The first thing to show is that (12.44) is actually well defined, that is, the set 3, is 


nonempty. The following lemma tells us this and gives an upper bound on S?,,..(/). 


Lemma 12.13 (The Span is Well Defined [565]) The quantity S2,,..(1) is well defined, 
in particular, the set B; is nonempty and, further, 


Sciass(!) < Dsy (12.46) 
where Dsy is the diameter of the smallest sphere containing the in-bound Support Vectors. 


After these definitions we have to put our results to practical use. The two key 
relevant bounds are now given. 


Theorem 12.14 (Misclassification Bound via the Span [565]) If in the leave-one-out 
procedure an in-bound Support Vector x, is recognized incorrectly, then the following 
inequality holds: 


a1 Sclass(1) max (D, c=) 21. (12.47) 


2. Vapnik and Chapelle [565] actually define a geometrical object which they call the span 
of a set of Support Vectors and then compute its distance to a pattern ®(x;). 
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Additionally, if the sets of SVs, and of in-bound SVs, remain the same during the leave- 
one-out procedure, then, for any SV x, the following equality holds, 


yilf (x1) — f'a) = ai Saas). (12.48) 


Here f! is the minimizer of the SV classification problem in which x, has been removed. 


This means that we may use (12.47) as an upper bound on the size of the mis- 
classification error. Furthermore, (12.48) can be employed as an approximation 
of the leave-one-out error, simply by counting the number of instances where 
yf (x1) < aySaass(1)*. This is, of course, not a bound on the leave-one-out error 
any more, since we can, in general, not expect that the remaining Support Vectors 
will not change if the pattern x; is removed. Yet it may be a more reliable estimate 
in practice. Note the similarity to the result of Jaakkola and Haussler [259] of the 
previous section; there, Sqass(/)? is replaced by k(x), xı). We conclude this section 
by noting that the span bound for the y-SVM has been studied in detail in [217]. 


12.2.4 The Span Bound for Quantile Estimation 


Let us briefly review the approach of Chapter 8. A key feature was the integration 
of a single class SVM with the v-trick, namely the fact that we may specify a certain 
fraction of patterns to lie beyond the hyperplane beforehand. In particular recall 
(8.6) and, subsequently, the dual problem (8.13) with constraints (8.14) and (8.15). 

In this case the constraints on the Lagrange multipliers are given by 0 < a; < +. 
This setting, however, creates a problem in the case of a leave-one-out procedure; 
should we adjust + to TaD or keep the original upper constraint and simply 
remove the variable corresponding to the pattern x;? The first case is more faithful 
to the concept of leave-one-out error testing. The second, as we shall see, is much 
more amenable to the proof of practical bounds. In addition, keeping the original 
constraints can be seen as a replacement of v by v’ = v(1+ +), This means that the 
threshold v is slightly increased for leave-one-out training. Therefore we can ex- 
pect that the number of leave-one-out errors committed in the case of an estimator 
trained with v’ will be larger than for one trained with v. Further, for large m, this 
change is negligible. We begin with a definition of the span. 


Definition 12.15 (Span/Swap of an SV Quantile Estimation Solution) Denote by 
X the training sample and by a4,...,Qm, p the solution obtained by solving the corre- 
sponding quantile estimation problem with corresponding parameter v. Moreover, denote 
by a1,- - -, Qn the in-bound SVs. Then the span Ssupport(!) with respect to the pattern | is 
defined as follows. 


a If the number of SVs (in-bound or not) n* is bounded from below by n* — 1 > vm we 
define the span as 


Shupport(!) := min > BiBjk(xi, xj), (12.49) 
BEB i,j=1 
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where 


b= {8 | 6 = 1, $5 =0, and 0 < ai + abi < 1/(vm)}. (12.50) 


i= 


a Ifthe number of SVs is given by n* = [vm] we define the swap of a Support Vector by 
J i=1 
where as usual we use the kernel matrix Kij := k(x;, x;). 


We do not have to consider the case n* < [vm] since, according to Proposition 8.3, 
n* > vm. We next show that Ssupport(/) is well defined and compute a bound on it. 


Lemma 12.16 (The Span is Well Defined) The quantity S apport) is well defined; the 
set 3, is nonempty and, moreover, 


Ssupport(/) < Dsy (12.52) 


where Dsy is the diameter of the smallest sphere containing the in-bound Support Vectors. 


Proof For n* = [vm] it is clear that Swap(/) is well defined. For n* — 1 > vm we 
have to show that a set of 3; (with i Æ l) exists such that 


n 

1 
5 8; = 1 where 0 < a; + abi < — and p; > 0 (12.53) 
A vm 
i=1,iÆl 
since in this case Sippo) = a7 ||®(x1) — ®|? where ® = Xii; 6i®(x;) is an 
element of the convex hull of the in-bound Support Vectors, and thus the diameter 
of the corresponding sphere Dsy is an upper bound. 


Note that the maximum value for each 8; (with i # 1) is given by a;b} = 4 — a; 
and, thus, 
5 5 1 x 1 n* 
al p= Z -el = |Z -al = — —1+a >ar. (12.54) 
i=1,iŻl i i=l i] LUM i idii LUM ' my. 
By rescaling each 8* with a constant factor 0 < u < 1 we obtain suitable 8; = wi? 
which satisfy the conditions imposed in the definition of Suppor(! ). E 


Next we have to state an analog of Theorem 12.14. In this context we must define 
more specifically what we consider an error, and, thus, a leave-one-out error for 
the problem of quantile estimating. 

Rather than using the threshold p obtained by minimizing the adaptive regular- 
ized risk functional (see Chapter 8) we should introduce an additional? “margin” 
Ap. A pattern is only classified as atypical if f(x) < p — Ap. Otherwise all SVs, 
whether in-bound or not, would be classified as leave-one-out errors. 


3. This additional margin Ap is also needed in order to prove uniform convergence-type 
results. 
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Theorem 12.17 (Misclassification Bound via the Span) If, in the leave-one-out pro- 
cedure, an in-bound Support Vector x, is recognized incorrectly, then (with the additional 
margin Ap as described above) the following inequality holds, 


D vm 
a1 Ssupport(/) max & 5) > 1 (12.55) 
ifn* — 1 > vm. Otherwise 

a Ap Swap(!) max (D, c=) >1 (12.56) 


is applicable. Furthermore, if the sets of Support Vectors and of in-bound Support Vectors 
remain the same during the leave-one-out procedure, then for any Support Vector x; the 
following equality holds; 


(p! = f'n) = a Ssupport(!)”. (12.57) 


Here f' is the minimizer of the SV classification problem where x! has been removed. 


Proof As in [565], denote by a? the solution obtained by minimizing the regu- 
larized risk functional depending on m samples, and by a’ the solution obtained 
through leaving out sample / (analogously denote by p! the margin obtained by 
such an estimator). Further, denote the value of the dual objective function de- 
rived from the regularized risk functional by 


_ . ji + . m l 1 
D(a) := min —5a Ka subject to do =land0<a;< oa (12.58) 
By construction D(a?) < D(a? — ô) for all 6 € R” such that the constraints of (12.58) 
are satisfied, and, in particular, for 6; = a;. Similarly, for a! we have D(a!) < 
D(a! +7) for all y € R” satisfying the constraints of (12.58) and +, = 0. Hence 
we obtain 


Dy := D(a?) — D(a? — ô) < D(a?) — D(a!) < D(a! +4) — D(a!) =: Dz. (12.59) 


Next we have to compute or bound D; and D}. For n* — 1 > vm we choose 6 to be 
the minimizer of (12.49); 6; = —a; pi. This gives, 


Dı = — F(a") Ko! + Lo — 6)'K(a® — ô) (12.60) 
= 587 Kö — òT Ka (12.61) 
= 50787 KB — 6° (Ko? — p) (12.62) 
= Aar, wer) (12.63) 


Here (12.62) follows from the choice of ô and the fact that X; 6; = 0 (note also that 
X; ðip = 0). Finally, (12.63) is due to the fact that 6; A 0 for in-bound SVs only and 
thus the second term in (12.62) vanishes. 

For n* = [vm] we cannot find a suitable ô based solely on in-bound SVs and 
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therefore we must add an additional pattern, say with index j, which was not 
previously a Support Vector. All we do is swap a; and qj, giving ô = a; and 


6; = —q, . In particular, we pick j to minimize (12.51). For Dı we thus obtain 
Dı = 56°Ké —ô' Kal (12.64) 
=a (Ku + Kjj — 2K) +8! (p— Ka) (12.65) 
= of (Ku + Kj; —2Kij) + a (£ aiKij— p) = Swap°(l). (12.66) 
= 


Next we have to compute D2. We choose a particular value of y, namely qj = 
a = —qı for some j, within the set of in-bound SVs obtained from leave-one-out 
training and set all other coefficients to 0. Expanding D> leads to 


1 1 
D> = =a ee = y! Ka! = 37" = aa [Ka E o'| (12.67) 
q2 
= -7 (Ku + Kjj —2K;;) —a (5 al Ki — ‘) : (12.68) 
i=1 


If x; generates a leave-one-out error we know that (p! — Xi- a} Kj) > Ap. In addi- 
tion, (Ku + Kj; — 2K;;) is the squared distance between two SVs obtained by leav- 
ing x; out. This, of course, can be bounded by D?, where D? is the radius of the 
data (in feature space). Therefore 


2 
a 

D, > “a +aAp (12.69) 

and, after unconstrained maximization over a, we obtain Dz > (apy for Amin = ae. 

We have to take into account, however, that a < + and thus, for Amin > 1, we 

obtain 

Ap D? D? D? 


22 -mP * vm = ~ Rompe * Wm Bom 


(12.70) 
Here we exploit the assumption that Amin = wf > a. Taking the minimum of the 
two lower bounds leads to the following inequality for D3; 


2 2 
Dz > Imin (Se, aa) . (12.71) 
Finally, since Dı < D2, then (12.71) in combination with (12.63) proves the bound. 

To prove (12.57) note that, for the case where no additional point becomes a 
Support Vector, then, by construction mins D(a? — ô) = D(a!) and, furthermore, 
min, D(a! + 7) = D(a). Moreover, note that in this case n* — 1 > vm since, oth- 
erwise, the v-property would not be satisfied for the leave-one-out estimate. We 
next show that ô = —aq;\ where A is the minimizer of SZupport(!): To see this note 


that, for any ô € R” with X; 6; = 0 and 4; = ay, it follows from (12.60) that 


Dı = -ZTK +ô' Kal = ~567K6 + §'(Ka® — p) = — 567 K6. (12.72) 
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The latter is maximized (D(a — 6) is minimized) for 6 = —a;X and, thus, Dı = 

— FOF Support Finally, note that 

D(a! +7) — D(a’) = -47 Ky —7' Ka (12.73) 
= 57 Kop Ka =f) (12.74) 
= — 50} Shippo) — 97 (Kal — p) (12.75) 
= 1o Spoon) -a'e — p) (12.76) 


where (12.74) follows from >; 6; = 0, (12.75) is a consequence of the optimality 
of a 7X for y, and in (12.76) all but one term in the dot product vanish since 
f'(x) — p! = 0 for all in-bound Support Vectors. Exploiting the equality between 
Dı and the minimum value of D(a! + y) — D(a!) proves (12.57). a 


Even though the assumption that led to (12.57) is rarely satisfied in practice, it 
provides a good approximation of the leave-one-out error and therefore may be 
preferrable to the more conservative estimate obtained from Theorem 12.17. 


12.2.5 Methods from Statistical Physics 


Opper and Winther [396] use a reasoning similar to the above that leading to 
provide an estimator of the leave-one-out bound. As in (12.48) and (12.57), the 
following assumptions are made: 


= In-Bound Support Vectors x; will remain so even after removing a pattern x; and 
the corresponding function values f(x;) will not change under the leave-one-out 
procedure. 


m Margin errors will remain margin errors. 


a Correctly classified patterns with a margin will remain correctly classified. 


These assumptions are typically not satisfied, yet in practice the leave-one out 
approximation that follows from them is still fairly accurate (in the experiments 
of [396] the accuracy was to within 1 + 10°). The practical appeal of the methods 
in this section is their computational simplicity when compared to the span bound 
and similar methods, which require the solution of a quadratic program.” 

For the sake of simplicity we assume that the first n patterns are in-bound SVs, 
and that patterns n + 1 to n* are bound SVs. We begin with a simple example — 


4. Other studies, such as the one by Dawson [131] found lower but still relatively high 
accuracies of the estimate at a very favorable computational cost of the method. 

5. In fact, the span bound almost attempts to solve the quadratic program resulting from 
leaving one sample out exactly, under the assumption that the SVs remain unchanged. 


12.2 Leave-One-Out Estimates 377 


SV classification without a constant threshold. Here 


f(x) = ` ajk(xj, x) (12.77) 
j=1 

fo = 5 alk(xj, x) (12.78) 
j=, jA 


Using the notation ða; := a}, — aj, we obtain 


fi(x) -= f(x) = 5 ða;k(xj, x) — ark(xı, x). (12.79) 
j=l xl 

For an in-bound SV x; (where i £ l) to remain so after the leave-one out procedure 

we need f(x;) = f!(x;) or, more specifically, 


n“ 
ayk(x1, xi) = D ða;k(xj, xi). (12.80) 
j=1,j#l 

This leads to a system of n variables with n linear constraints if x; is a bound 
constrained SV (and n — 1 variables and constraints if x; is an in-bound SVs). 
The obvious search is for a method to solve this linear system efficiently for all 
n* SVs. We will show that, in a slightly more general setting of semiparametric 
models (and/or additional adaptive margins), we may compute the leave-one-out 
approximation with O((n + N)?n*) cost. This is considerably cheaper than the n 
linear programs in n variables that must be solved in the case of the span bound. 
Additionally, the estimates may be even more precise than those when using the 
span bound; margin errors are not necessarily real errors, nor does the fact that 
a pattern is a margin error automatically imply that it will be misclassified by a 
classifier which ignored it during training. 

Rather than deriving equations for the simple case of a pure kernel expansion, 
we consider the situation that we have a semiparametric model, in particular a 
kernel expansion plus a small, fixed number of terms (see Section 4.8), and we 
derive a closed form expression for it. This includes the addition of a constant 
offset b, as a special case. Without going into details (which can be found in 
Chapter 4) we have the following kernel expansion 


N m 
F(x) = È bipil) + X, aik(xi, x) (12.81) 
i=1 i=1 
subject to t aip;(xi) =0 for all j € [N]. (12.82) 
i=l 


Here ~j;, with i € [N], are additional parametric functions (setting 4ı(x) = 1 and 
N = 1 would lead the case of a constant offset, b). The following proposition gives 
an approximation of the changes due to leave-one-out training. As before, only the 
Support Vectors matter in the calculations. To give a more concise representation, 
we State the equations in matrix notation. 
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Proposition 12.18 (Mean Field Leave-One-Out Approximation) Denote by K” € 
R"*" the submatrix of kernel functions between in-bound SVs and by K" € R" —")x" 
the submatrix of kernel functions between in-bound and bound SVs. Likewise, denote by 
P” € R'*N the matrix consisting of the function values y;(x;), where x; are in-bound 
SVs, and by ¥" € R—”*N the matrix of function values w);(x;) where x; are bound 
SVs. Then, if x; is an in-bound SV, the following approximation holds; 

-1 


=i 
e "| | (12.83) 
Il 


yw" o 
? 
| | (12.84) 
Il 


Proof We begin with the case that x; is a bound constrained Support Vector. 
Since we have to enforce f(x;) = f!(x;) for all in-bound SVs, while maintaining 
the constraints (12.82), we have a system of n + N variables (a; and (3) together 
with n + N constraints. In matrix notation the above condition translates into 


fa- fix) & a | 


and, for bound constrained SVs, 


—1 
KË K" Cree 
wr yp” 0 


Here we use Kj, as shorthand for k(x), x1). 


Kv 
wr 


f (x1) — fx) & ay s- | 


sa k(x1, x7) 
kK" cp")! Qn = K(Xn, X71) (12 85) 
y” 0 | 


=a 
ôb p(x) 


ôn Ynlx) 


Solving (12.85) for da; and 6; and substituting into the approximation for f'(x) 
leads to 


n N 
Fa) f) & ank(x1, x1) — X saiki x) — X bipi) (12.86) 
i=1 i=1 
Kx) ] k(x, x) 
by x) | PK cpt > ee a 
= ay Ky — a sai | ssi (12.87) 
p(x) P g p(x) 
Wn(X1) wn (x1) 


Rewriting (12.87) leads to (12.84). To compute this expression efficiently we may 
use an indefinite symmetric (e.g. triangular) factorization of the inverse matrix 
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into T' DT. Often it will be necessary to compute the pseudoinverse [131], since 
K" tends to be rank degenerate for many practical kernels. Overall, the calculation 
costs O((n + N)?(n* + N)) operations. 

In the case that x; is an in-bound Support Vector we obtain a similar expression, 
the only difference being that the row and columns corresponding to x; were 
removed from (12.87) in both K” and ¥”. Recall that (see [337], 9.11.3.2a) 


—1 


A C B 
ce’ pi. — 
(12.88) 
A~! + A'C(D — CTAC) ICTA! —A7'IC(D — CTAC)! 
—(D — C! ATIC) ICT AT! (D-—C!Ac)7! 


By setting A equal to the square matrix in (12.87), with the contributions of x; 
removed, and, further, identifying C with the remaining column vector (which 
contains the contribution of x;) we see that (12.87) can be rewritten as in (12.83). 

a 


Remark 12.19 (Modifications for Classification) In the case of classification, the 
function expansions are usually given by sums of yiaik(xi, x). Simply replace cj by yiai 
throughout to apply Proposition 12.18. 


Since the assumptions regarding the stability of the types of SVs that led to this 
result are the same as the ones that led to (12.48) and (12.57), it comes as no 
surprise that the trick (12.88) can also be applied to compute (12.44) under those 
assumptions. This is due to the fact that in this case, the box constraints in 12.45 
can be dropped. These issues are discussed in detail in [102]. 

In order to apply a similar reasoning to v-SVM a slightly modified approach 
is needed. We only state the result for classification. The proof and extensions to 
regression and novelty detection are left as an exercise (see Problem 12.4). 


Proposition 12.20 (Mean Field Leave-One-Out for v-Classification) Let K” de- 
note the n x n submatrix of kernel functions between in-bound SVs, and K" the 
(n* — n) x n submatrix of kernel functions between in-bound and bound SVs. Moreover, 
denote by y” € R" the vector of labels (+1) of the in-bound SVs, likewise by y” € R" —" 
the vector of labels of bound SVs, and by 1" € R" and 1" € R"—" the corresponding 
vectors with all entries set to 1. Then, if x; is an in-bound SV, the following approximation 
holds; 
Sja =l 
K” y” 1” 

mfe- 0- uf -Axa y 0 0 (12.89) 
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and, for bound SVs; 
(yif (a1) — p) — fa) — p & 
j n n n = n* i 
- s yra e (12.90) 
aryı |Ku- || y” y"' 0 0 y” 
1” 1? 0 0 i” 


Again, we use Ky as a shorthand for k(x), xı). 


Remark 12.21 (Absolute Differences in Classification) Often it will appear more 
useful to compute yı(f (x1) — f'(x))) only, rather than the relative distance from the mar- 
gin. In this case we have to compensate for the changes in p'; we should compute only the 
changes in da; and 6b (for the constant offset). One can see that, for boundary constrained 
SVs, 


ĝa 
—1 T 
: K” y” 1” K" 
an (=| yT 0 0 y” (12.91) 
bb i 0 0 i” 
dp 


and, therefore, the correction term in dp can be easily ignored in the expansion. We obtain 


yf (x) — f(x) = yı È ôaiyik(xi, xı) + ôb | . (12.92) 


i=1 


As far as in-bound SVs are concerned, this is not so easily achieved, since the matrix to 
be inverted is different for each pattern x). Luckily it is obtained by removing one row and 
one column from the full (n + 2) x (n + 2) system (see the proof of Proposition 12.18 and 
(12.91)). This allows us to compute its inverse by performing the converse operation to a 
rank-1 update. To do this we use (12.88) in the opposite direction. One can easily check 


by substitution that for aD pT 
Computing A™! costs only O(n?) operations per in-bound Support Vector, which is 
acceptable, in particular, when compared to the cost of inverting the (n + 2) x (n + 2) 
system itself. This means that we can perform prediction as cheaply as in the standard 
SVM case. 


Si 
C Y T 
= | A | we have AT! = Y —T'A'Y. 


Extensions to the situation where we have loss functions other than the €- 
insensitive or the soft margin will require further investigation. It is not yet clear 
what the equivalent of in-bound and bound SVs should be, since it is a very rare 
case that the slope of the loss function c(x, y, f(x)) changes at only a small num- 
ber of locations (for example, only once in the soft margin case, or twice in the 
e-insensitive case, etc.). This is needed, however, for cheap computation of the 
leave-one-out approximation. 
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12.3 PAC-Bayesian Bounds 


Gibbs Classifier 


This section requires some basic knowledge of ideas common to Bayesian estima- 
tion (see Chapter 16). In particular, we suggest that the reader be familiar with the 
content of Section 16.1 before going further in this section. The reasoning below 
focuses on the case of classification, primarily in the noise-free case. The first work 
in this context is by McAllester [353, 354] with further applications by Herbrich 
and coworkers [240, 213, 238]. 

The proof strategy works as follows; after definitions of quantities such as the 
Gibbs and the Bayes classifier, needed in the context, we prove a set of theorems, 
commonly referred to as PAC-Bayesian theorems. These relate the posterior proba- 
bility of sets of hypotheses to uniform convergence bounds. Finally, we show how 
large margins and large posterior probabilities are connected through the concept 
of version spaces. 


12.3.1 Gibbs and Bayes Classifiers 


In a departure from the concepts of the previous chapters we will extend our view 
from single classifiers to distributions over classifiers. This means that, instead of 
a deterministic prediction, say f(x) = y, we may obtain predictions f according 
to some P(y|x). This additionally means that we have to extend the notion of 
empirical or expected loss of a function f to the empirical or expected loss with 
respect to a distribution. 

In the following denote by F such a set of classifier, by P(f) a prior probability 
distribution over mappings f : X — Y and by P(f|Z) a posterior probability distri- 
bution (see Section 16.1), based on P(f) and the data Z. We now proceed with the 
definitions of risk and loss wrt. P(f) and P(f|Z). 


Definition 12.22 (Risk with Respect to Distributions) Denote by P(f) a distribu- 
tion over F. Then the risk functional, with respect to a distribution, is defined as 


RIPA] := Eje [RUT] (12.93) 

and, in particular, 

RempIP(f)] = EeP) [RempLf]| (12.94) 
Rreg[P(f)] := E jnre [RempLf] + AQLST] « (12.95) 


Taking a sampling point of view, by considering the classifiers f directly, we arrive 
at the Gibbs classifier. 


Definition 12.23 (Gibbs Classifier) The Gibbs Classifier is defined by the following 
random variable, 


foibs(x) = f(x) where f ~ P(f|Z). (12.96) 
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In other words, fcipps(x) is given by f(x) where f is drawn randomly from P(f|Z) for fixed 
Z. Note that, by definition, RI fcinps] = R[P(f|Z)]. 


Another way to construct a predictor from a distribution p(f|Z) is to predict 
according to the majority values of f(x) where f ~ P(f|Z). We obtain the Bayes 
Classifier: 


Definition 12.24 (Bayes Classifier) Denote by P(f|Z) a distribution over mappings 
f:X—+Y where x € X and Z € (X x Y)”. Then the Bayes optimal classifier is given 


by 
foayes(x) := argmax P(f(x) = y|Z). (12.97) 
y 


In other words, the Bayes optimal estimator chooses the class y with the largest posterior 
probability. 


Note that for regression (12.97) will lead to an estimator based on the mode of the 
posterior distribution p(f(x)|Z). While in many cases the mode and the mean of a 
distribution may be reasonably close (this is one of the practical justifications of the 
maximum a posterior estimates, see (16.22)) the mean need not necessarily be any- 
where close to the optimal estimate. For instance, for the exponential distribution 
e78 on [0, ca), the mode of the distribution is 0, the mean, however, is 1. 

Despite their different definitions, the Bayes classifier and the Gibbs classifier are 
not completely unrelated. In fact, the following lemma, due to Herbrich, holds: 


Lemma 12.25 (Gibbs-Bayes Lemma [238]) Denote by P(f |Z) a distribution over map- 
pings f : X —> Y, where x € X and Z € (X x Y)", and by R[f] the loss due to f under the 
0 — 1 loss, namely the loss function c(x, y, f(x)) = (1 — 6,(f(x))). Further, denote by |Y| 
the cardinality of Y. Then the following inequalities hold; 


RI foayes] < [|R fitos] = [YIRIPE |Z). (12.98) 


Proof To prove (12.98) we must consider the set AZ, where the Bayes classifier 
commits an error. It is given by 


AZ := {(x, y)|(x, y) € X x Y and c(x, y, fBayes(X)) = 1}. (12.99) 


Then, for any given distribution P(x, y), the error of the Bayes estimator is given by 
RU fBayes] = P(AZ). On AZ, however, the conditional probability P(y = fsayes(¥)|Z, x) 
is bounded from below by E since fpayes(x) is chosen according to (12.97). This 
means that the error of the Gibbs classifier fcipps on AZ is at least Wal feayes] which 
is a lower bound for R[ fcipps] on X x Y. E 
Note that a converse statement for bounding R[fGipbs] in terms of R[fBayes] is not 
true. In particular, one can find cases where R[fpayes] = 0 and RI fGibbs] > i + for 
any € > 0 (see also Problem 12.5). The practical use of Lemma 12.25 is in the ability 


to extend bounds on R[/fGibbs] to bounds on RI fgayes]. 
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In many cases it is necessary to relate the behavior of fBayes Or fGibbs to the behav- 
ior of a single hypothesis f obtained by a (possibly) different learning algorithm. 
The following characterization for sets of hypotheses (due to Herbrich [238]) is 
useful. 


Definition 12.26 (Bayes Admissible Hypotheses) Denote by F a space of hypotheses 
and by P(f) a probability (measure) over F. Then a subset Fg C F is called Bayes 
admissible with respect to P(f) and some function f € F if, for all (x,y) € X x Y, 


C(x, y, f(x) = C(x, Y, fBayes(x)). (12.100) 


Here c is the 0 — 1 loss function and fgayes is the Bayes estimator with respect to P(f|f € 
Fp), that is, with respect to f restricted to Fp. 


Simply put, the loss of the single hypothesis f € F has to agree with the Bayes 
estimator fBayes based on P(f|f € Fg). This criterion is not always easily verified, 
in the case of kernels, however, the following lemma holds. 


Lemma 12.27 (Balls in RKHS are Bayes Admissible) Denote by P(f) the uniform 
measure over a subset F’ C H of the RKHS KH. Then, for any additive offset b, any ball 
B,(f) C Fg with radius r and center f is Bayes admissible with respect to f + b. 


Proof Simply note that, for any cut through the ball, the center of the ball always 
lies in the bigger of the two parts (the offset b determines the amount by which the 
cutting hyperplane misses the center). Thus the estimator at the center of the ball 
f +b will agree with fpayes- E 


We use this lemma to connect the concept of maximum margin classifiers with the 
notion of Bayes estimators derived from a large posterior probability. 


12.3.2 PAC-Bayesian Bounds for Single Classifiers 


Our aim in this section is to bound the expected error of fBayes and fGibbs based 
on the posterior distribution P(f|Z) of the hypotheses on which zero, or low, 
training error is obtained. We begin with a simple binomial tail bound on a 
single hypothesis which achieves zero classification error on a sample of size m. 
Essentially it plays an analogous role to Hoeffding’s bound (5.7) in the nonzero 
loss classification and regression cases. 


Lemma 12.28 (Binomial Tail Bound) Denote by ~ P a random variable with values 
{0,1} and by £1, . . . , Em m instances of £, as drawn independently from P. Then, for 


m 


1 
Remp *= — 26 and R := E [£] (12.101) 


the following inequality holds 
P(Remp = O and R > e) < e™° for all m € Nand e > 0. (12.102) 
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Proof We prove the lemma by computing a bound on the probability of Remp = 0 
under the assumption that R > e. For any R =p >€ 


P(& =... = Em = O|R = p) = (1 — p)” < (1—€)” < exp(—me) (12.103) 


since, for all e € (0,1), we have (1 — €) <e-*. E 


This bound implies that, for a single hypothesis, achieving zero training error over 
a sample of size m (equivalently the probability for the expected error to differ 
from 0) decays exponentially as m increases. In all realistic cases, however, we 
have more than one hypothesis to choose from. 

While, given a set of possible hypotheses f to choose from, we could assign 
equal weight to all of them, there is absolutely no need to do so. In fact, we could 
“bet” beforehand, that is, before any data arrives, on the hypotheses we think 
will obtain a low error. Qualitatively, our goal is to obtain better bounds in the 
case where our bet is lucky, at the expense of obtaining slightly worse bounds for 
unlucky guesses. This concept was formalized by Shawe-Taylor et al. [491] and is 
commonly referred to as the Luckiness Framework (see also [241] for an extension 
of the framework to algorithms rather than classes of functions). We give a simple 
version [238], which can be used to combine bounds for individual hypotheses. 


Theorem 12.29 (Combining Hypotheses) Denote by P a probability measure on X 
and by yhi(x, ô) : X x R —> {TRUE, FALSE} with i € N, parametrized logical formulas 
for which 


Px(qi(x, ô) = TRUE) > 1 — ô forall0 < ô < 1. (12.104) 
Then, for all 6; > 0 with X; 6; = 6 < 1, the following inequality holds 


P, (TI yix, ô) = mur) >1—ô. (12.105) 


Here we used J] to denote a logical AND, and > to denote a logical OR. 


Proof To prove (12.105) we replace the Ihs by its complementary event and then 
use a simple union bound argument on the individual terms. We obtain 


Py (11 w(x, ĝi) = mue) =1-P, (z wi(x, d;) = ratse) (12.106) 
>1- XP: TA ði) = FALSE) (12.107) 
Sas =i (12.108) 

Here the last inequality follows from (12.104). a 


Note that typically the logical formulas 7j(x, 6) will be expressions as to whether 
a certain bound holds and, moreover, P, will be the probability measure over all 
m-samples. Next we combine Lemmas 12.28 and 12.29 to obtain McAllester’s first 
PAC-Bayesian theorem [353], namely bounds on the generalization performance 
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Zero Loss Case 


Nonzero Loss 
Case 


of classifiers with zero training error. Furthermore, by using Hoeffding’s bound 
(Theorem 5.1), we may obtain a counterpart for the case of Remp[f] > 0. 


Theorem 12.30 (Single Classifiers with Zero Loss [353]) For any probability distri- 
bution P(X, Y) from which the training sample Z is drawn in an iid fashion, for any (prior) 
probability distribution P(f) on the space of hypotheses F, for any 6 € (0,1), and for any f 
for which Remp[ f] = 0 and which has nonzero P(f), the following bound on the expected 
risk holds; 


—InP(f) —Iné 


P(R[f] >£) < ô for € = 7 


(12.109) 
This means that the bound becomes better if we manage to “guess” f as well as 
possible by a suitable prior distribution P(f). Additionally, it justifies the maxi- 
mum a posteriori estimation procedure (see also Section 16.2.1) since for identical 
likelihoods (zero error) the prior probability is the only quantity to distinguish be- 
tween different hypotheses f. We prove this by a combination of Lemma 12.28 and 
Theorem 12.29. 


Proof A sufficient condition for (12.109) is to show that, simultaneously for all f 
(and not only, say, for the maximizer of P(f)) the bound holds. All we must do 
is set 6; = dP(f) and consider the logical formulas 7;(Z, 6) (here Z is the training 
sample) where the binomial tail-bound (12.103) is violated; 


(Z, 8) = {Reno 4 0or RIf] > = 7 Ins} (12.110) 
Since X; 6; = ô X; P( fi) = ô we satisfy all conditions of Theorem 12.29, which proves 
(12.109). m 


It is straightforward to obtain an analogous version for nonzero loss. The only 
difference is that we have to use Hoeffding’s theorem 5.1 instead of the binomial 
tail bound. 


Corollary 12.31 (Single Classifiers with Nonzero Loss [353]) For any probability 
distribution P(X, Y) from which the training sample Z is drawn in an tid fashion, for 
any (prior) probability distribution P(f) on the space of hypotheses ẸF, for any 6 € (0,1), 
and for any f with nonzero P(f) the following bound on the expected risk holds; 


P (RIf] >€) < § for € = Remplf] +f =? (12.111) 


Proof Using (5.7) define the logical formulas 4);(Z, 6) with 


WilZ, ô) := frr > RempLfi] + =| . (12.112) 


By construction we have Pz(w(Z,6)) < ô. Setting 6; := dP(f;) and using N;X; we 
obtain (12.111). E 
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Figure 12.1 Left: prior probability distribution P(f) (solid line) on F and restriction of 
P(f|F'’) to F’ (dotted line). Right: P(f) (solid line) and reweighted posterior distribution 
P(f|Z) (dotted line). 


Comparing (12.111) with (12.109) we note the difference in the dependency on m. It 
appears that (12.111) with Tn in the denominator is much less tight than (12.109), 
where the bound depends on +. 

This is, however, an artifact of Hoeffding’s bound in the sense that the factor 
of 2m in the denominator increases with decreasing R[f] and, thus, allows for a 
tighter bound (for R[f] —> 0 we recover the behavior of the binomial tail bound). 
Unfortunately, the latter is rather technical, which is why we omit a detailed 
description of the improvement. See [574], among others, for details on how to 
obtain tighter bounds. 


12.3.3 PAC-Bayesian Bounds for Combinations of Classifiers 


More interesting than single classifiers, however, is the question of whether and 
how bounds on combinations of such classifiers can be obtained. Two possible 
strategies come to mind; we could combine from a subset J’ C F of the space of 
hypotheses and weigh them according to the prior distribution P(f) which we fix 
before finding a good set of estimates [353]. Alternatively, we could use the posterior 
distribution P(f|Z), influenced by the performance of estimates on the data, as a 
weighting scheme [354]. See Figure 12.1 for an illustration of the two cases. The 
following two theorems give uniform convergence bounds for these. 


Theorem 12.32 (Combination with Prior Weighting [353]) As before, denote by F a 
hypothesis class and by P(f) a probability measure on F, denote by F’ C F a measurable 
subset, and by P(f|¥") the probability distribution obtained from P(f) by restricting f 
to F and with normalization by P(F’). Then the following bounds hold with probability 
1-06: 


RIPE] < Remp PIF] + 


_ _ 1 
SA ee (12.113) 


2m 
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—In6d —InP(S’)+2Inm+1 


RPS < - 


FRP = 0 (12.114) 


We will give a proof of (12.114) at the end of this section, after we stated the 
“Quantifier Reversal Lemma” (Lemma 12.34), which is needed in the proof. Eq. 
(12.113) can be proven analogously and its proof is therefore omitted. Before we 
do so, however, we give the related result for a posterior distribution P(f|Z) rather 
than only a restriction of P(f) onto F. 


Theorem 12.33 (Combination with Posterior Weighting [354]) Again denote by F 
a hypothesis class and by P(f) and P(f|Z) two probability measure on F, where P(f|Z) 
depends on the data Z. Then the following bound holds with probability 1 — ô; 


d(P(f|Z)||P(f)) — Ind + nm +2 


RIP(F|Z)] < Remp[P(f|Z)] + ea (12.115) 
Here d(P(f|Z)||P(f)) is the Kullback-Leibler divergence between P(f|Z) and P(f). It is 
given by 

IPEIDIPA) := Eria (in GEE? a216) 


The proof of Theorem 12.33 is rather technical and we refer the reader to [354] for 
details. Note that (12.115) depends on the Kullback-Leibler divergence between 
the prior and posterior distributions. This is an (asymmetric) distance measure 
between the two distributions and vanishes only if they coincide. Consequently, 
the bound improves with our ability to guess (represented by the prior probability 
distribution P(f)) the likely outcome of the estimation procedure. 

On the other hand, it means that unless we make a lucky guess, the alternative 
being to remain cautious and choose a flat (= constant) prior P(f) over F, we 
will not obtain a bound that is much tighter than the logarithm of the number 
of significantly distinct functions in F. We obtain a bound similar to (5.36), the 
only benefit being the automatic adaptation of the scale, determined by the spread 
of P(f|Z), to the learning problem at hand. This “lucky guess” will allow us to 
take advantage of favorable data. However, we have to keep in mind, that the 
performance guarantees can also be significantly worse, if we are “unlucky” in the 
specification of the prior. The remaining terms such as In ô or the m~? dependency 
are standard. 

Finally, note that for a restriction of F onto a subset J’ the Kullback-Leibler di- 
vergence between P(f|F’) and P(f) becomes — In P(F’) since in this case dP(f|F’) = 
P(F')dP(f) if f € F. This means that, up to constant terms, (12.113) is a special case 
of (12.113). 

As promised we conclude this section with a proof of (12.114). For this purpose 
we need a key lemma; the so-called “Quantifier Reversal Lemma”. 
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Lemma 12.34 (Quantifier Reversal [353]) Denote by x, y random variables and let 6 € 
(0, 1]. Furthermore, let w(x, y, 0) be a measurable formula such that, for any x, y, we have 
{8 € (0, 1] : w(x, y, 6) = TRUE} = (0, max] for some dmax. If, for all x, and 6 > 0, 


P {ab(x, y,5) = TRUE} > 1-6 (12.117) 
then, for all 6 > 0, and 3 € (0,1), 
Py {for alla >0: Py fy CG y, (apò) ) = TRUE} St a} >1-6. (12.118) 


This means that, if a logical formula holds for all x and ô with high probability 
for all y, then, with high probability for all y, x, and for a fixed J, the formula is 
also true. We transferred the uncertainty, initially encoded by ô, into one, jointly 
encoded by 6 and a. Now we are ready to prove (12.114). For the purpose of the 
proof below, f will play the role of x and Z the role of y. 


Proof of (12.114). We will use the Binomial tail-bound of Lemma 12.28. By anal- 
ogy to (12.110) we define 


WZ, f 6) = { Renplfl = 0 implies R[f] < - ma). (12.119) 


By construction (and via Lemma 12.28) we have that, for all f, and all 6 > 0, 
Pz{u(Z, f,d) = TRUE} < 1 — ô. This expression has the same form as the one 
needed in (12.117). Therefore, for all 6 > 0, and £ € (0,1), 


Pz {for alla >0: Py fv (5 Z, (0,38) ) = TRUE} Sic a} >1—6. (12.120) 


The goal is to contract the two nested probability statements on Z and f into 
one on Z alone. This is done by replacing the inner (probabilistic) statement by a 
(deterministic) superset. Consider the argument of the first probability statement. 
Here, with probability 1 — ô, 


P; fv (f, Z, (0,38) ) = TRUE} sica (12.121) 


Now we substitute values for a and 3. Denote by 3’ C F a set for which 
Remp[P(f|’)] = 0 and, moreover, let a = Ran and 3 = +. We obtain 


m` 


Pr {Repl f1=9 mpi aif BE ean > P(S") 


>1-— 12.122 
-Im mo 0212) 
This means that on J’ the bound holds at least with (1 — 1) probability (recall 
that F’ C {f|Remplf] = 0}). In all other cases, the loss is bounded by 1 (we are 
dealing with a classification problem). Averaging over all f € ¥’ adds a + term to 


the inequality of the Ihs of (12.122). This replaces the probabilistic statement over 
f by a deterministic bound and we obtain that with probability 1 — ô, 


_inP@) Ind + 2inm+1 


Remp[P(f|¥)] = 0 implies R[P(f|F")] < — 


(12.123) 


The proof of (12.113) is similar and is left as an exercise (see Problem 12.7). a 
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12.3.4 Applications to Large Margin Classifiers 


We conclude this section with applications of the PAC-Bayesian theorems to the 
domain of large margin classifiers. The central idea is that a classifier achieving a 
large margin (or, equivalently, a small value of ||zw||) corresponds to the summary 
of a large set of classifiers, which all have desirable generalization performance 
properties. This is so since, for a large margin classifier, there exist many similar 
classifiers which (albeit with a smaller margin) achieve the same (or similar) 
training error. 

To keep matters simple we show the basic idea using classifiers that achieve zero 
training error and where, further, all points lie on the surface of a hypersphere. 
The latter is a technical restriction that renders the computation of the volume of 
version spaces, the space F’ of hypotheses with Remp[f] = 0, much easier. We begin 
with the definition of a normalized margin. 


Definition 12.35 (Normalized Margin) Denote by f(x) := (w, ®(x)) a classifier with 
||w|| = 1. Here the normalized margin Pnorm(f) is defined as 


ae Vif (Xi) 
Pnorm(f) = min joco (12.124) 


In other words, this margin is normalized with respect to the length of the feature 
vectors of the training sample. The following bound on the generalization perfor- 
mance holds. 


Theorem 12.36 (PAC-Bayesian Margin Bound [238]) For a feature space with di- 
mension n € NU {oo} and a linear classifier (w,®(x)) achieving zero empirical risk 
Rempl f] = 0 the following bound holds with probability 1 — 6; 


RU] < Z (-ain (1 ye Phom(P)) +2Inm—In6 + 2) (12.125) 


Here d := min(m,n). 


Proof Ina first step we must translate the size of Pporm(f) into a corresponding 
size of the version space. For this purpose we give a lower bound on the maximum 
angle. Any other w’ may be such that a corresponding f'(x) := (w’, ®(x)) will still 
achieve Remp[f’] = 0. 

The latter is equivalent to requiring that the angle 7(w’, ®(x;)) between w’ and 
®(x;) must not exceed $. By the triangle inequality we have 


|Z(w', ®(x;))| < |Z(w’, w)| + |Z(w, ®(x;))| < [Z(w’, w)| + Bi |Z(w, ®(x;))]. 


A sufficient condition for this inequality to hold is 


\Z(w,w’)| < 5 = max|Z(w, (x)|. (12.126) 
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Figure 12.2 A geo- 
metric interpretation of 


(w, w’) < V l= Prorm(f)- 


Taking the cosine on both sides (this is admissible since all angles involved are 


smaller than +) yields 
2 
) — A IS Paon I (12.127) 


Figure 12.2 depicts this situation graphically. 

We assume a uniform prior distribution over all functions on the unit sphere, 
thus P(f) = const. in addition, previous results mean that the set given by 
F := {w'| (w,w’) > /1— prom(f) is consistent with the observations Z, hence 
RempIP(f|¥’)] = 0. Therefore, P(F’) is given by the volume ratio between J’ (the 
cap of the cone spanned between w and w’) and the unit sphere in N dimen- 
sions. After tedious calculations involving spherical harmonics and a binomial 
tail bound (see [238] and [209, 373]) we find that 


Vol(3") 


nP) = In ay Sih (1 mee a ae | P) ; (12.128) 


This means that, for f drawn from P(f|F"), we may apply (12.113) of Theo- 
rem 12.32. Therefore, with probability 1 — ô, the following bound holds; 


RIPY|F)| < m7! (- Ind — Ind ln (1 —4/1- Pem) +2Inm+ 1) : (12.129) 


The final step is to translate the statement about a distribution of functions into one 
about a single classifier, namely the one given by w. This is achieved by appealing 
to the Gibbs-Bayes Lemma (Lemma 12.25) which allows the conversion from a 
Gibbs classifier to a Bayes-optimal classifier, and Lemma 12.27, which shows that, 
in fact, w is the Bayes-optimal classifier corresponding to F. 

The net effect is that the rhs of (12.129) increases by a factor of 2 (we have y = 1 
and y = —1 as possible outcomes of the classification).® E 


(xj) 


w,w’)|>4/1—min (weer 
|w) | 2 fd min Tel 


6. See also [314] for an improved version which does not depend on the cardinality of Y. 
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12.4 Operator-Theoretic Methods in Learning Theory 


In the previous section we considered distributions over the space of hypotheses 
F which correspond to a weighting scheme over all possible functions. One could, 
however, also use an explicit covering of F with a set of “representative” functions 
in order to approximate a (possibly) infinite number of hypotheses by a finite 
number N. This is advantageous since there exist many uniform convergence 
bounds for finite hypothesis classes. Further, there exist tools from the theory 
of Banach spaces which allow us to bound the number of functions needed to 
approximate J with sufficiently high precision. 

The concepts presented in this section build largely on [606, 605, 518]. It is 
impossible to convey the results in complete detail and to point out further ex- 
tensions, many of which are due to Mendelson [358, 357], Kolchinskii [303], and 
coworkers. For another application of entropy numbers in mathematical learning 
theory, see [129]. 

We limit ourselves to the basic ideas underlying scale sensitive capacity mea- 
sures such as covering and entropy numbers, and the fat shattering VC dimension 
(Section 12.4.1). In this context we show how these capacity measures can be used 
to formulate bounds on the generalization performance of estimators and present 
analogues of Theorems 12.32 and 12.33, based on the number of functions needed 
to approximate F. The rest of the section is then devoted to methods for efficiently 
computing such capacity measures. Examples of translation invariant kernels (cf. 
Section 4.4) conclude the presentation. 

The material below builds on the mathematical prerequisites of functional 
analaysis and entropy numbers summarized in Section B.3.1. 


12.4.1 Scale-Sensitivity and the Fat Shattering Dimension 


In Section 5.5.6 we introduced the notion of the VC dimension of a class of indi- 
cator functions F as the maximum number of points which F is able to shatter in 
any arbitrary way. Note that this notion was scale insensitive — changes of the 
sign of f were considered relevant regardless of the amount of change in f. This is 
not always the best way of measuring the capacity of functions, in particular when 
considering regression problems or large margin classifiers, where the scale of the 
solution matters. The following remark sheds some light on this problem: 


Remark 12.37 (Gaussian RBF Networks with Infinite VC Dimension) Denote by 
r an arbitrary positive number and X € RN a compact set. Consider the class of functions 


nafs 


where k is the Gaussian kernel. We show that F has infinite VC dimension by demonstrat- 
ing that any arbitrary set X = {x1,...,Xm} C X of size m can be shattered by thresh- 


f = Ș ajk(xi, -) with x; € X, X aia jk(xi, x;) < r} 3 (12.130) 
i ij 
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olded functions from F. According to Theorem 2.18, the matrix (k(x ;,x;))ij has full rank. 
Hence, for arbitrary {y1,...Ym} E€ {—1,1}, there exists a function f(-) = X; aik(x;,-) 
with f(x;) = yj for all j. Rescaling f to satisfy the inequality in (12.130) yields an feF 
which still shatters the set, proving the statement. 


The term X; j ajajk(x;,x;) in (12.130) equals the ||w||? regularizer in H, as used 
in SVMs (see (2.49)). The VC dimension is thus infinite even if ||w|| is small, 
and it can thus not directly be used to justify the large margin regularizer in SV 
regression. Things are slightly different in pattern recognition. There, our final 
scaling operation (obtaining f from f) would leave us with a hyperplane which is 
no longer in canonical form with respect to (x1, y1), - - - , (Xm, Ym), cf. Definition 7.1. 
Nevertheless, there are some problems in using the VC dimension also in that case, 
see [491].7 

The construction described in Remark 12.37 was possible since we were allowed 
to rescale f without sacrificing any of its discriminatory power. With a large margin 
classifier, on the other hand, we seek to find a solution which is the least scale 
sensitive possible. 

A first step towards better bounds is to introduce a scale sensitive counterpart 
of the VC dimension, dubbed the (level) fat shattering VC dimension. It was 
introduced to statistical learning theory by [286]. According to [537], however, the 
idea of fat shattering itself had been proposed by Kolmogorov already in the late 
1950s in the context of approximation theory. 

The fat shattering dimension of a function class Ẹ is a straightforward extension 
of the VC dimension. It is defined for real valued functions as the maximum 
number of points that can be +-shattered. Here a set {x1,. . ., Xm} is y-shattered, 
if there exist some b; € R such that for all sets y;{+1} there is an f € F with 
yi(f (xi) — bi) > y. A slightly more restrictive definition is the level fat shattering 
dimension, where a set is y shattered if y;(f(x;) — b) > y for one common value of 
b. For applications to classification see [491, 460]. [22, 6] discuss the estimation of 
real valued functions. 


12.4.2 Entropy and Covering Numbers 


Despite its improvement over the original definition, the fat shattering dimension 
is still a fairly crude summary of the capacity of the class of functions under 
consideration. Covering and entropy numbers can be used to derived more finely 
grained capacity measures. We begin with some definitions. 


7. Note that in Theorem 5.5, the hyperplane decision functions are only defined on a 
finite set of points. When defined on the whole space, the notion of canonicality cannot 
be employed anymore. Canonicality, however, is the notion that introduces scale sensitivity 
into the VC dimension analysis of margin classifiers. To work around this problem, we 
either have to use the notion of fat shattering dimension described below, or we have to 
define decision functions taking values in {+1,0}, with the value 0 referring to the margin 
[85, 564]. 
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Recall that an e-cover of a set M in E is a set of points in E such that the union 
of all e-balls around these points contains M. 

The e-covering number of F with respect to the metric d, denoted N(e, F, d), is the 
smallest number of elements of an e-cover for F using the metric d. Typically, F 
will be the class of functions under consideration. Moreover, d will be the metric 
induced by the values of f € F on some data X = {x1, . . ., Xm}, such as the 2% 
metric. We denote this quantity by €%(X). For e = 1 we recover the (scale less) 
definition of the covering number (Section 5.5.3). 

To avoid some of the technical difficulties, that come with this dependency on X, 
one usually takes the supremum of N(e, F, E4 (X)) with respect to X. This quantity 
will be called the e-growth function of the function class F. Formally we have 


Nv(e,5)i= sup Ne,F,6%), (12.131) 
X 1 50065Xm EX 

where N(e, F, LX) is the e-covering number of F with respect to 4X. Most gener- 

alization error bounds can be expressed in terms of N” (e, F). An example (Theo- 

rem 12.38) is given in the following section. 

Covering numbers and the growth function are inherently discrete quantities. 
The functional inverse of N” (e, F), referred to as the entropy number, however, is 
more amenable to our analysis. The entropy number of a set M C E, for n € N, is 
given by 


€n(M) = mife > 0 


there exists an e-cover for M in E (12.132) 
containing n or fewer points 


Since we are dealing with linear function classes, we will introduce the notion 
of entropy numbers of operators and represent the possible function values that 
these linear function classes can assume on the data as images of linear operators. 

For this purpose we need to introduce the notion of entropy numbers of oper- 
ators. Denote by E, G Banach spaces and by £(E, G) the space of linear operators 
from E into G. The entropy numbers of an operator T € £(E, G) are defined as 


€n(T) := €,(T(Ue)). (12.133) 


Note that €;(T) = ||T||, and that €,,(T) is well-defined for all n € N precisely if T is 
bounded (see Section B.3.1). Moreover, limy—so0 €n(T) = 0 if and only if T is compact; 
that is, if T(Ug) is precompact. 

A set is called precompact if its closure is compact. A set is called compact if every 


sequence in S has a subsequence that converges to an element also contained in 
59 


8. Strictly speaking, we should be considering the notion of relative compactness; however, 
for Banach spaces, this coincides with precompactness, and we can disregard these ramifi- 
cations. 
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The dyadic entropy numbers of an operator are defined as 
Q(T) := ez- (T), nen (12.134) 


similarly, the dyadic entropy numbers of a set are defined from its entropy num- 
bers. A beautiful introduction to entropy numbers of operators is given in a book 
by Carl and Stephani [90]. 


12.4.3 Generalization Bounds via Uniform Convergence 


Recall the reasoning of Section 5.5. There we explained how uniform convergence 
bounds in terms of the covering number could be obtained. For the sake of con- 
creteness, we quote a result suitable for regression, which was proved in [6]. Let 
En[f] := 4D, f(xi) denote the empirical mean of f on the sample x1, . . . , Xm. 


Lemma 12.38 (Alon, Ben-David, Cesa-Bianchi, and Haussler, 1997) Let F be a class 
of functions from X into [0,1]. For all e > 0, and all m > 4, 


e z em 
P {ap |E»Lf]— ELf1| > e} <12m-E [x (5 F, y exp (C=) , (12.135) 


where the P on the left hand side denotes the probability w.r.t. the sample x1,...,Xm 
drawn tid from the underlying distribution, and E the expectation w.r.t. a second sample 
X = (x],...,X4,,), also drawn iid from the underlying distribution. 


In order to use this lemma one usually makes use of the fact that, for any P, 
Ey [N(e, J; em (X)) | <N(e,). (12.136) 


An alternative is to exploit the fact that N(e, F, ”(X)) is a concentrated random 
variable and measure N on the actual training set. See [66, 293] for further details 
on this subject. Theorem 12.38 in conjunction with (12.136) can be used to give a 
generalization error result by applying it to the loss-function induced class. The 
connection is made by the following lemma: 


Lemma 12.39 (Lipschitz-Continuous Loss [606, 14]) Denote by F a set of functions 
from X to [a,b], witha < b, a,b € RU £00 and by 1 : R —> Rẹ a loss function satisfying 
a Lipschitz-condition 


KE) — KE) < Clé — | for all £, ¢' € [a — b,b — a]. (12.137) 


Moreover, let Z := (xi, yi)'Ł4, Lelz, =Œ) — yp, lllz = Ulz) a le|z = {l |z: f € 
F} and N(e,1|z) := N(e, lg|z, €2,). Then the following equation holds; 


€ 
< = : 
Zea bD" Ma M2) = Kew mM ( , Six) 218) 


The proof works by explicitly exploiting the Lipschitz property of the loss. Apply- 
ing this result to polynomial loss leads to the following corollary. 
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Corollary 12.40 (Polynomial Loss) Let the assumptions be as in Lemma 12.39. Then, 
for loss functions of type 


(n) = pn’ with p > 1, (12.139) 
we have C = (b —a)‘?-»), in particular C = (b — a) for p = 2 and, therefore, 

Iz) < a : 
oe Nez) max N c= "A rae |e ) (12.140) 


We can readily combine the uniform convergence results with the above results to 
get overall bounds on generalization performance. We do not explicitly state such 
a result here, since the particular uniform convergence result needed depends on 
the exact setup of the learning problem. In summary, a typical uniform conver- 
gence result (see for instance (5.35) or (12.135)) takes the form 


P{sup |Remp(f) — R(f)| > €} < (mN (€, Few", (12.141) 
f 


Even the exponent in (12.141) depends on the setting.? Since our primary interest 
is in determining N” (e, F) we will not try to summarize the large body of work 
now done on uniform convergence results and generalization error. 

These generalization bounds are typically used by setting the right hand side 
equal to 6 and solving for m = m(e,6) (which is called the sample complexity). 
Another way to use these results is as a learning curve bound €(d, m), where 


P{sup |Remp(f) — R(f)| > (6, m)} < ô. (12.142) 
f 
We note here that the determination of é(0, m) is quite convenient in terms of e, the 


dyadic entropy number associated with the covering number N” (e, F) in (12.141). 
Setting the right hand side of (12.141) equal to ô, we have 


ô — = ¢1(myN"(E, Pje e 
Slog, (<4) £= log, N"(e, $). (12.143) 
Eq. (12.143) is satisfied if we can find some e such that 


flog (for) + ee +1 | < € (12.144) 


holds. Clearly we want the minimal e that satisfies (12.144), since e determines the 
tightness of the bound (12.141). Therefore we define 


(ô, m) = min{e|(12.144) holds}. (12.145) 


Hence the use of en or e, (which will arise naturally from our techniques) is, in 
fact, a convenient thing to do to find learning curves. 


9. In regression p can be set to 1, however, in agnostic learning Kearns et al. [287] show 
that, in general, 6 = 2, except if the class is convex, in which case it can be set to 1 [322]. 
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The key idea in the present section concerns the manner in which the covering 
numbers are computed. Traditionally, appeal has been made to the Sauer-Shelah- 
Vapnik-Chervonenkis Lemma (originally due to [567] and rediscovered in [493, 
458]). In the case of function learning, a generalization due to Pollard (called the 
pseudo-dimension), or Vapnik and Chervonenkis (called the VC dimension of real 
valued functions, see Section 5.5.6), or a scale-sensitive generalization of that (the 
fat-shattering dimension) is used to bound the covering numbers. These results 
reduce the computation of N’(e,F) to the computation of a single “dimension- 
like” quantity. An overview of various dimensions, some details of their history, 
and some examples of their computation can be found in [13]. 


12.4.4 Entropy Numbers for Kernel Machines 


The derivation of bounds on the covering number (and entropy number) of F 
proceeds by making statements about the shape of the image of the input space X 
under the feature map ®. We make use of Mercer’s theorem (Theorem 2.10) and 
of the scaling operator constructed in Section 2.2.5. 

Recall that in Proposition 2.13, where we described valid scaling operators 
that map ®(X) into h, the numbers l; are related to the eigenvalues according 
to (2.50). Following (2.50), it was pointed out that for some common kernels, it 
is not necessary to distinguish between l; and A;. In the present section, we will 
formulate the results for the /;; however, the reader may bear in mind that these 
are essentially determined by the A;. 

In the following (without loss of generality) we assume the sequence of (I; ; (cf. 
(2.50)) is sorted in nonincreasing order. 

As discussed in Section 2.2.5, the rate of decay of the eigenvalues has implica- 
tions for the area occupied by the data in feature space. 

As a consequence of Proposition 2.13, we can construct a mapping A from the 
unit ball in 4 to an ellipsoid € such that ®(X) C €, as in the following diagram: 


¢ ——° o —4" — U, (12.146) 
N A 
E 


The operator A will be useful for computing the entropy numbers of concatena- 
tions of operators. (Knowing the inverse will allow us to compute the forward 
operator, and that can be used to bound the covering numbers of the class of func- 
tions, as shown in the next subsection.) 

Define 
(s vi) j 


R:= (12.147) 


fo 
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From Proposition 2.13 it is clear that we may use 


s Jt) 

( ives); 
We call such scaling (inverse) operators admissible. The next step is to compute the 
entropy numbers of the operator A and use this to obtain bounds on the entropy 


numbers for kernel machines such as SVMs. We make use of the following theorem 
(see [208, p. 226], stated in the given form in [90, p. 17]). 


A= RS"! = G, (12.148) 


£2 


Theorem 12.41 (Diagonal Scaling Operators) Let o1 > 02 > +--> oj >+- >0bea 
non-increasing sequence of non-negative numbers and let 


DX = (01X1, 02X2,- . -3 0jX jy...) (12.149) 


for x = (X1,X2,..., Xj, . ..) € & be the diagonal operator from ¢, into itself, generated by 
the sequence (aj) ;, where 1 < p < œ. Then, for all n € N, 


sup 17 (oo . s < €n(D) < 6 - sup ni(i . noj). (12.150) 
jEN jEN 

We can exploit the freedom in choosing A to minimize an entropy number as 
the following corollary shows. This is a key ingredient in our calculation of the 
covering numbers for SV classes, as shown below. 


Corollary 12.42 (Entropy Numbers for Admissible Scaling Operators) Let k: X x 
X — R be a Mercer kernel and let the scaling operator A be defined by (12.148) and R by 
(12.147), with (isi), € &. Then 


1 


€n(A: b2 > b2) < sup6R (n+ ayaz+++aj) 7. (12.151) 
jeN 

This result follows immediately from the identification of D and A. We can opti- 

mize (12.151) by exploiting the freedom that we still have in choosing a particular 

operator A among the class of admissible ones. This leads to the following result 

(the infimum is in fact attainable [220]). 


Corollary 12.43 (Entropy Numbers for Optimal Scaling) There exists an A defined 
by (12.148) and R defined in (12.147) that satisfies 


en(A:bh2 > &2)< inf sup6R (n-am. -a;) . (12.152) 
(Delvis); E&L jeN ( i 


The functions that an SV machine generates can be expressed as x + (w, ®(x)) + b, 
where w, ®(x) € H and b € R. The “+b” term is dealt with in [606]; for now we 
consider the simplified class 


Fa := {x m (w, @(x))|x € X, |[w]] < A} C R™. (12.153) 


What we seek are the ¢% covering numbers for the class F4 induced by the 
kernel in terms of the parameter A. As described in Chapter 7, this is the inverse 
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of the size of the margin in feature space, or, equivalently, the size of the weight 
vector in feature space as defined by the dot product in H. We call such hypothesis 
classes with a length constraint on the weight vectors in feature space SV classes. 
Let T be the operator T = Sqx)A where A € Rt, and define the operator Sq x) by 


Sax) : £2 ad bd 
Sexy: W H  ((®(x1),W),..., (P(Xm), W)) - 


The following theorem is useful in computing entropy numbers in terms of T 
and A. Originally due to Maurey, it was extended in [89]. See [605] for further 
extensions and historical remarks. 


(12.154) 


Theorem 12.44 (Maurey’s Bound [90]) Let m € N and S € £(H, ¢%), where H is a 
Hilbert space. Then there exists a constant c > 0 such that, for all m,n € N, 


en(S) <cl|S|| (n log, (1 n a (12.155) 


An alternative proof of this result (given in [605]) provides a small explicit value 
for the constant; c < 103. 

The restatement of Theorem 12.44 in terms of €9:-1 = en will be useful in the 
following. Under the assumptions given we have 


1/2 
€n(S) < c||S|| Ce n)~' log, (1 + cen) forn > 1. (12.156) 
log, n 


Now we can combine the bounds on entropy numbers of A and Sgx) to obtain 
bounds for SV classes. First we need the following lemma. 


Lemma 12.45 (Product Bound [90]) Let E,F,G be Banach spaces, R € £(F, G), and 
S € Q(E, F). Then, for n,t E€ N, 


€nt(RS) < €n(R)e(S) (12.157) 
en(RS) < en(R)IIS|| (12.158) 
€n(RS) < €n(S)||R|- (12.159) 


Note that the latter two inequalities follow directly from (12.157) and the fact that €,(R) = 
||R|| for all R € £&(F, G). 


Theorem 12.46 (Bounds for SV classes) Let k be a Mercer kernel, let ® be induced 
via (2.40) and let T := SgxyA where Sax) is given by (12.154) and A € RH. Let A be 
defined by (12.148). Then the entropy numbers of T satisfy the following inequalities, for 
n> 1; 


€x(T) < cl|Al|Alog; /? n log!” (1 + ja =) (12.160) 
€,(T) < 6Aen(A) (12.161) 
enl(T) < 6cAlog, /? nlogy/” (1 4 — €:(A) (12.162) 


where c is defined as in Lemma 12.44. 
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This result gives several options for bounding €,(T). We shall see in examples later 
that the best inequality to use depends on the rate of decay of the eigenvalues of k. 
The result gives effective bounds on N” (e, Fa) since 


€n(T: b2 > CL) < eo => N"(€0,Fa) < n. (12.163) 


Proof We use the following factorization of T to upper bound ¢,(T). 


Us, Ch — = e" (12.164) 


Sax) 
A S(4-10(x)) 


AU; Ch —^ Àe Ch 


The top left part of the diagram follows from the definition of T. The fact that the 
diagram commutes stems from the fact that, since A is diagonal, it is self-adjoint 
and hence for any x € X, 


(w, ®(x)) = (w, AA!@(x)) = (Aw, A7!®(x)). (12.165) 


Instead of computing the covering number of T = S@x)A directly, which is difficult 
or wasteful, as the bound on Sa(x) does not take into account that ®(x) € € but 
just makes the assumption of P(x) € pU,, for some p > 0, we will represent T as 
S(4-1@(x) AA. This is more efficient as we construct A such that ®(X)A7! € Uy, fills 
a larger proportion of it than just Z@(X). 

By construction of A, and due to the Cauchy-Schwarz inequality, we know that 
|S 4-1(xy|| = 1. Thus, applying Lemma 12.45 to the factorization of T, and using 
Theorem 12.44 proves the theorem. m 


As we see below, we can give asymptotic rates of decay for €,(A). (In fact we 
give non-asymptotic results with explicitly evaluable constants.) It is thus of some 
interest to give overall asymptotic rates of decay of €,(T) in terms of the order 
of €,(A). By “asymptotic” here we mean asymptotic in n; this corresponds to 
asking how N(e, F) scales as e + 0 for fixed m. 


Lemma 12.47 (Rate bounds on €n) Let k be a Mercer kernel and suppose A is the 
scaling operator associated with it, as defined by (12.148). 


1. If en(A) = O(log,“ n) for some a > 0 then for fixed m 

€,(T) = O(logs ®t? n). (12.166) 
2. If log, €n(A) = O(log;” n) for some (3 > 0 then for fixed m 

log, en(T) = O(log; " n). (12.167) 


This lemma shows that, in the first case, Maurey’s result (Theorem 12.44) allows an 
improvement in the exponent of the entropy number of T, whereas in the second, 
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it affords none (since the entropy numbers decay so fast anyway). The Maurey 
result may still help in that case for nonasymptotic n. Note that for simplicity of 
notation we dropped to mention the dependency of the bounds on m. See e.g., 
[526, 606] for further details. 


Proof From Theorem 12.44 we know that €,(S) = Olog; n). Now use (12.157), 
splitting the index n in the following way; 


n = nnt with 7 € (0,1). (12.168) 
For the first case this yields 
en(T) = O(log; /? n")O(logs°n!-7) = 171 — 1) O(log C? n). (12.169) 
In the second case we have 


log, €n(T) = log, (a 20(log7 n)) +1 —7)fOllogz’ n) = O(logz® n).(12.170) 
E 


In a nutshell we can always obtain rates of convergence better than those due 
to Maurey’s theorem, because we are not dealing with arbitrary mappings into 
infinite dimensional spaces. In fact, for logarithmic dependency of €,(T) on n, the 
effect of the kernel is so strong that it completely dominates the n~'/? behavior for 
arbitrary Hilbert spaces. An example of such a kernel is k(x, y) = exp(—(x — y)?); 
see Proposition 12.51 and also Section 12.4.5 for the discretization question. 


12.4.5 Discrete Spectra of Convolution Operators 


The results presented above show that if we know the eigenvalue sequence of a 
compact operator, we can bound its entropy numbers. Whilst it is always possible 
to assume that the data fed into a SV machine has bounded support, the same can 
not be said of the kernel k(-, -). A commonly used kernel is k(x, y) = exp(—(x — y)?) 
which has noncompact support. The induced integral operator 


TAE f Kef) dy (12.171) 


then has a continuous spectrum and is not compact [17, p.267]. The question arises 
as to whether we make use of such kernels in SVMs and still obtain generalization 
error bounds of the form developed above? A further motivation stems from the 
fact that, by a theorem of Widom [595], the eigenvalue decay of any convolution 
operator defined on a a compact set via a kernel having compact support can 
be no faster than Aj = O(e-"’). Thus, if we seek very rapid decay of eigenvalues 
(with concomitantly small entropy numbers), we must use convolution kernels 
with noncompact support. 

We will resolve these issues in this section. Before doing so, let us first consider 
the case that supp k C [—a, a] for some a < oo. Suppose, further, that the data points 
x; satisfy x; € [—b,b] for all j. If k(-,-) is a convolution kernel (k(x, y) = k(x — y)), 
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then the SV hypothesis /,(-) can be written 


hy(x) = X, ajk(x, xj) = X &jkp(x, xj) =: hy, (x) (12.172) 
j=l j=l 
for p > 2(a + b), where kp(-) is the p-periodic extension of k(-), as given by (4.42); 
[0,0] 
k(x, x’) := $, k(x- jp, x’). (12.173) 
j=-0o 


Here we used k(x, x’) = k(x — x’). We now relate the eigenvalues of T;, to the 
Fourier transform of k(-). The following lemma is a direct consequence of Proposi- 
tion 4.12 (all we need to do is replace 27 by the new period p). 


Lemma 12.48 (Discrete Spectrum of Periodic Kernels) Let k: R — R be a symmetric 
convolution kernel, let K(w) = F[k(x)](w) denote the Fourier transform of k(-) and ky 
denote the p-periodical kernel derived from k (assume also that kp exists). Then kp has a 
representation as a Fourier series with wo := = and 


Jor 


[oe] 
2 , ijwox 
k(x-y= >, K jwo)e e 


j=-0o 
= ZEKE) + > Z VZTK jw) cos(jwo(x — y)). (12.174) 
j=1 


Moreover, Aj = V2TK(jwo) for j € Z and Cp = 2 (see (2.51) for the definition of C). 
Finally, for k : RN — RN and a p-periodic kernel kp in each direction (x = (x1,...,XN)) 
(derived from k), we get the following spectrum Aj of kp 


dj = (20)N/?K(woj) = 2r) K(wo|lj||) where Cy = (2/ py”. (12.175) 


Thus, even though T, may not be compact, Tą, may be (if (K(jwo))jen C 42 for 
example). The above lemma can be applied whenever we can form k,(-) from 
k(-). Clearly k(x) = O(x-“+°) for some e > 0 suffices to ensure the sum in (12.173) 
converges. 

Let us now consider how to choose p. Note that the Riemann-Lebesgue lemma 
tells us that, for integrable k(-) of bounded variation (surely any kernel we would 
use would satisfy that assumption), one has K(w) = O(1/w). There is an trade-off 
in choosing p in that, for large enough w, K(w) is a decreasing function of w (at 
least as fast as 1/w) and thus, by Lemma 12.48, A; = /27K(27j/p) is an increasing 
function of p. This suggests one should choose a small value of p. But a small p 
will lead to high empirical error (as the kernel “wraps around” and its localization 
properties are lost) and large Cx. There are several approaches to picking a value 
of p. One obvious one is to a priori pick some € > 0 and choose the smallest p such 
that |k(x) — k,(x)| < € for all x € [—p/2, p/2]. Thus one would obtain a hypothesis 
hy,(x) uniformly within Cë of hy(x), where D7, |aj| < C. 

Finally, it is worth to note how the choice of a different bandwidth of the 
kernel, namely letting k((x) := ok(cx), affects the spectrum of the corresponding 
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operator. We have K)(w) = K(w/«), hence scaling a kernel by g means more 
densely spaced eigenvalues in the spectrum of the integral operator Tyo). 

In conclusion: in order to obtain a discrete spectrum, we need to use a periodic 
kernel. For a problem at hand, one can always periodize a nonperiodic kernel in 
a way that changes the estimated function in an arbitrarily small way, hence the 
above results can be applied. 


12.4.6 Covering Numbers for Given Decay Rates 


In this section we will show how the asymptotic behavior of €,(A: l2 + 42), where 
A is the scaling operator introduced before, depends on the eigenvalues of T;.. 

Note that we need to sort the l; in a nonincreasing manner because of the re- 
quirements in Corollary 12.43. Many one-dimensional kernels have nondegener- 
ate systems of eigenvalues. In this case it is straightforward to explicitly compute 
the geometrical means of the eigenvalues, as will be shown below. Note that whilst 
all of the examples which follow are for convolution kernels (k(x, y) = k(x — y)), 
there is nothing in the formulations of the propositions themselves that requires 
this. When we consider the N-dimensional case we shall see that, with rotationally 
invariant kernels, degenerate systems of eigenvalues are generic. This can be dealt 
with by a slight modification of theorem 12.41 — see [606] for details. 

Let us consider the special case in which (1;); decays asymptotically with some 
polynomial or exponential degree. In this case we can choose a sequence (aj); for 
which we can evaluate (12.152) explicitly. By the eigenvalues of a kernel k we mean 
the eigenvalues of the induced integral operator Tp. 


Proposition 12.49 (Polynomial Decay [606]) Let k be a Mercer kernel with l; = 
B jC for some a > 0. Then for any 6 € (0, a/2) we have 


€n(A: ly > £2) = O(In7? t? n). (12.176) 
An example of such a kernel is k(x) = e~*. 

Proposition 12.50 (Exponential Decay [606]) Suppose k is a Mercer kernel with l; = 
Be) for some a, B > 0. Then 

Ine, (A: £2 > b) = O(In? n) (12.177) 


An example of such a kernel is k(x) = z- 


Proposition 12.51 (Exponential Quadratic Decay) Suppose k is a Mercer kernel with 
1; = B?e~°U- for some a, B > 0. Then 


Inen(A: b2 > h2) = O(n? n). (12.178) 


An example of such a kernel is the Gaussian k(x) = e}, 


We conclude this section with a general relation between exponential-polynomial 
decay rates and orders of bounds on €,(A). 


12.5 Summary 
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Proposition 12.52 (Exponential-Polynomial decay) Suppose k is a Mercer kernel 
with 1; = Be!" for some a, B, p > 0. Then 


In €n(A: £2 > b) = O(n n) (12.179) 


In [606], it is shown that the rates given in Propositions 12.49, 12.50, 12.51, and 
12.52 are tight. These results give the guarantees on the learning rates of estimators 
using such types of kernels, which is theoretically satisfying and leads to desirable 
sample complexity rates. In practice, however, it is often better to take advantage 
of estimates based on an analysis of the distribution of the training data since the 
rates obtained by the latter can be superior [604]. 


12.5 Summary 


In this chapter we studied four alternative ways of assessing the quality of an 
estimator, none of which relies on the VC dimension as the basic mechanism. 

The first method was based on the concept of algorithmic stability, i.e., on the 
fact that the estimates we obtain by minimizing the regularized risk functional do 
not depend to a large extent on individual instances. This facilitated the applica- 
tion of concentration of measure inequalities and ultimately the proof of uniform 
convergence bounds. 

A second strategy relied on leave-one-out estimators, which provide an (almost) 
unbiased, yet somewhat noisy estimate of the expected error of the estimation 
algorithm. In some cases, however, such as the minimization of regularized risk 
functionals, it is possible to derive upper bounds on the variance of the leave- 
one-out estimate. In this context we presented three means of computing such 
estimates, based on ideas from statistical mechanics and optimization theory. 

Thirdly, also Bayesian-like concepts can be employed in the assessment of an 
estimator. In this context, we introduced the notion of the risk of a distribution 
over hypotheses rather than a single function. Using a connection between the 
width of the margin and the posterior weight we established bounds which are 
readily predictive already for small sample sizes but which are oblivious of the 
shape of the data mapped into feature space. 

Finally, we gave a brief overview over a capacity concept based on the metric 
entropy of function spaces which crucially relies on the shape of the data in feature 
space. This enabled us to give significantly improved bounds on the capacity of 
function classes derived from specific kernels, by combining concepts from the 
theory of Banach spaces and functional analysis. 

Only time and future research will tell, whether we may be able to establish a 
master concept which encompasses all these different facets of bounds on the gen- 
eralization performance of estimators. It seems to be a rewarding and promising 
avenue for future work. 
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12.6 Problems 


12.1 (Uniform Convergence Bounds for SVM e) Prove a uniform convergence state- 
ment for Support Vector Machines using (12.5) and Theorem 12.4. 


12.2 (Adaptive Margin SVM with the v-Property ee) An alternative to (12.43) is to 
use the v-trick (see Section 3.4.3) to modify the optimization problem (12.42). Prove that 


ales 1< 
minimize mee 


subject to yi >. ajyjk(xi, xj) > p— & for alli € [m] (12.180) 
fi 
Qis Ši, P > 0 for alli € [m] 
has the v-property, that is, that at most a fraction of 1 — v patterns are classified correctly 
with margin larger than p and that at most a fraction of v patterns are margin errors. 


12.3 (Span Bound for v-Classification 000) Prove an analog to the Span Bound (The- 
orem 12.14) in the case of v-Classification. Hint: you will have to distinguish between the 
case of n* = [vm] and n* > [vm] which will lead to the introduction of Span and Swap 
of a SV solution as in the case of quantile estimation (see Section 12.2.4). 


12.4 (Leave-One-Out Approximation for v-SVM ooo) Using the techniques from 
Section 12.2.5 derive an approximation of the leave-one-out error in the case of v-SVM. 
As opposed to the standard SV setting you also have to take possible changes in the width 
of the margin into account. Hint: assume that in-bound SVs will remain so and adapt the 
Lagrange multipliers œ; accordingly ul that Lil aj = 1 is satisfied (note: do not rescale 
the bound constrained multipliers from + z to zaz): Simultaneously adapt the margin p 


such that the in-bound SVs still remain in- -bound SVs. In other words choose dp and da; 
such that, for all in-bound SVs, Xiz 0ajk(x;, Xj) = ajk(xj, xj) + dp. 


12.5 (Counterexample for R[ [cibos] ] < CR[ feayes] ¢@) Show that there exist cases where 
R[ feayes] = 0 and RI fcivbs] > 4 +€ cs any £ > 0. Hint: consider a posterior distribution 
p(f\|Z) where } > p(f(x) = TZ, x) > i-e. 


12.6 (Error of the Gibbs Classifier e) Assume that we know the distribution P(x,y) 
according to which data are generated. What is the error that the Gibbs classifier fGibbs 
will achieve? Prove that it is always greater or equal to the error of the Bayes classifier. 
Can you construct a case where the Gibbs classifier has smaller error than the Bayes 
classifier (in the case where posterior distribution and true distribution disagree)? 


12.7 (PAC-Bayesian Bound for Nonzero Loss [353]eee) Prove (12.113) by following 
the steps of the proof of (12.114). Step 1: derive a suitable w(f , Z, ô) by using ipa ae 
bound. Step 2: apply the Quantifier Reversal Lemma with a = PG’) and 3 = + and recall 
that the loss incurred by classification is bounded by 1. 


II 


KERNEL METHODS 


Nur droben, wo die Sterne, 
Gibt’s Kirschen ohne Kerne. 
H. Heine! 


We have previously argued that one of the merits of SV algorithms is that they 
draw the attention of the machine learning community to the practical usefulness 
of statistical learning theory. Whilst the mathematical formalization that this has 
brought about is certainly beneficial to the field of statistical machine learning, 
it is by no means the only merit. The second advantage, potentially even more 
far-reaching, is that SV methods popularize the use of positive definite kernels 
for learning. The use of such kernels is not limited to SVMs, and a number of 
interesting and useful algorithms also benefit from the “kernel trick.” 

The third and final part of the present book therefore focuses on kernels. This 
is done in two ways. First, by presenting some non-SV algorithms utilizing kernel 
functions, starting with Kernel PCA (historically the first such algorithm), and effi- 
cient modifications thereof (Chapter 14). We then move on to a supervised variant, 
the kernel Fisher discriminant (Chapter 15), and describe a kernel algorithm for mod- 
elling unlabelled data using nonlinear regularized principal manifolds (Chapter 17). 
We also discuss Bayesian variants and insights into kernel algorithms (Chapter 16). 
Finally, we describe some rather useful techniques for reducing the complexity 


1. From Nachgelesene Gedichte 1845 - 1856. 
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of the kernel expansion returned by the above algorithms, and by SVMs (Chap- 
ter 18). It turns out that these techniques also have the potential to develop into 
other algorithms, in areas such as denoising and nonlinear modelling of data. 


13 


Overview 


Designing Kernels 


In general, the choice of a kernel corresponds to 


a choosing a similarity measure for the data — Section 1.1 introduces kernels as a 
mathematical formalization of a notion of similarity; 


= choosing a linear representation of the data — given a kernel, Section 2.2 con- 
structs a linear space endowed with a dot product that corresponds to the kernel; 


= choosing a function space for learning — the representer theorem (Section 4.2) 
states that the solutions of a large class of kernel algorithms, among them SVMs, 
are precisely given by kernel expansions; thus, the kernel determines the func- 
tional form of all possible solutions; 


= choosing a regularization functional — given a kernel, Theorem 4.9 character- 
izes a regularization term with the property that an SVM using that kernel can be 
thought of as penalizing that term; for instance, Gaussian kernels penalize deriva- 
tives of all orders, and thus enforce smoothness of the solution; 


a choosing a covariance function for correlated observations; in other words, en- 
coding prior knowledge about how observations at different points of the input 
domain relate to each other — this interpretation is explained in Section 16.3; 


= choosing a prior over the set of functions — as explained in Section 4.10 and 
16.3, each kernel induces a distribution encoding how likely different functions 
are considered a priori. 


Therefore, the choice of the kernel should reflect prior knowledge about the prob- 
lem at hand. Specifically, the kernel is the prior knowledge we have about a prob- 
lem and its solution. Accordingly, just as there is no “free lunch” in learning (Sec- 
tion 5.1), there is also no free lunch in kernel choice. 

This chapter gathers a number of methods and results concerning the design of 
kernel functions. We start with a somewhat anecdotal collection of general meth- 
ods for the construction of kernels that are positive definite (Section 13.1). We then 
consider some interesting classes of kernels engineered for particular tasks. Specif- 
ically, we look at string kernels for the processing of sequence data (Section 13.2), 
and locality-improved kernels that take into account local structure in data, such 
as spatial vicinity in images (Section 13.3). Following this, we summarize the key 
features of a class of kernels that take into account underlying probabilistic mod- 
els, and can thus be thought of as defining a similarity measure which respects the 
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process that generated the patterns (Section 13.4). 

The chapter strongly relies on background from Chapter 2; in particular, on 
Sections 2.1 through 2.3. Some background from Section 4.4 is useful to understand 
the Fourier space representation of translation invariant kernels. The material 
in Section 13.3 requires basic knowledge of the SVM classification algorithm, as 
described in Section 1.5 and Chapter 7. Finally, the section on natural kernels, 
builds on concepts introduced in Section 3.3. 


13.1 Tricks for Constructing Kernels 


Linear 
Combination 
of Kernels 


Product of 
Kernels 


Conformal 
Transformation 


We now gather a number of results useful for designing positive definite kernels 
(referred to as kernels for brevity) [42, 121, 1, 85, 340, 480, 125]. Many of these tech- 
niques concern manipulations that preserve the positive definiteness of Defini- 
tion 2.5; which is to say, closure properties of the set of admissible kernels. 


Proposition 13.1 (Sums and Limits of Kernels) The set of kernels forms a convex 
cone, closed under pointwise convergence. In other words, 


a if kı and kz are kernels, and a1, a2 > 0, then arkı + azka is a kernel; 


a if ky,ko,...are kernels, and k(x, x’) := liM, oo kn(x, x’) exists for all x, x’, then k is a 
kernel. 


Whilst the above statements are fairly obvious, the next one is rather surprising 
and often useful for verifying that a given kernel is positive definite. 


Proposition 13.2 (Pointwise Products [483]) If kı and kz are kernels, then kıkz, de- 
fined by (kyko)(x, x") := kı(x, x’)ko(x, x’), is a kernel. 


Note that the corresponding result for positive definite matrices concerns the 
positivity of the matrix that is obtained by element-wise products of two positive 
definite matrices. The original proof of Schur [483] is somewhat technical. Instead, 
we briefly reproduce the proof of [402], which should be enlightening to readers 
with a basic knowledge of multivariate statistics. 


Proof Let (Vi,...,Vmn) and (W1,..., Wm) be two independent normally dis- 
tributed random vectors, with mean zero, and respective covariance matrices 
Kı and Ky. Then the matrix K, whose elements are the products of the corre- 
sponding elements of Kı and Ky, is the covariance matrix of the random vector 
(ViW1,...; VinWin), hence it is positive definite. E 


A special case of the above are conformal transformations [10], 


k p(x, x!) = f(x)k(x, x) f(x’), (13.1) 
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obtained by multiplying a kernel k with a rank-one kernel (cf. Problem 2.15) 
k’(x, x’) = f(x) f(x"), where f is a positive function. Since 
Fk, xfx’) 
VION, AFAk, xfa) 
k(x, x’ j 
sea a 


this transformation does not affect angles in the feature space. 
For kernels of the dot product type, written k(x, x’) = k((x,x’)), the following 
conditions apply. 


cos (Z (® (x), P (x’))) = 


Theorem 13.3 (Necessary Conditions for Dot Product Kernels [86]) A differentiable 
function of the dot product k(x, x’) = k({x, x’)) has to satisfy 


k(t) > 0, k'(t) > 0 and k'(t) + tk”(t) > 0 (13.2) 
for any t > 0, in order to be a positive definite kernel. 
Note that the conditions in Theorem 13.3 are only necessary but not sufficient. The 


general case is given by the following theorem (see also Section 4.6.1 for the finite 
dimensional counterpart). 


Theorem 13.4 (Power Series of Dot Product Kernels [466]) A function 
k(x, x’) = k( (x, x’)) defined on an infinite dimensional Hilbert space, with a power series 
expansion 


[0] 

k(t) = X$ ant", (13.3) 
n=0 

is a positive definite kernel if and only if for all n, we have a, > 0. 


A slightly weaker condition applies for finite dimensional spaces. For further 
details and examples see Section 4.6 and [42, 511]. 

We next state a condition for translation invariant kernels, meaning kernels of 
the form k(x, x’) = k(x — x’). 


Theorem 13.5 (Fourier Criterion [516]) Suppose X C RN. A translation invariant 
function k(x, x’) = k(x — x’) is a positive definite kernel if the Fourier transform 


E[k](w) = (20)~? 7 eK) k(x) dx (13.4) 
X 
is nonnegative. 


This theorem is proved, and further discussed, in Section 4.4. 

As discussed in Chapter 2, we typically do not worry about the exact map ® 
to which the kernel corresponds, once we have a suitable kernel. For the purpose 
of constructing kernels, however, it can be useful to compute the kernels from 
mappings into some dot product space H, ® : X — H. Ideally, we would like 
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to choose ® such that we can obtain an expression for (®(x), ®(x’)) that can be 
computed efficiently. Consider now mappings into function spaces, 


xh fr, (13.5) 


with fy being a real-valued function. We further assume that these spaces are 
equipped with a dot product, 


(Fan fir) = f fol fol) du. (13.6) 
We can then define kernels of the form 
Mee =e) (13.7) 


for instance. As an example, suppose the input patterns x; are q x q images. Then 
we can map these patterns to two-dimensional image intensity distributions fy, 
(for instance, splines on [0,1]*). The dot product between the fy, then approxi- 
mately equals the original dot product between the images represented as pixel 
vectors, which can be seen by considering the finite sum approximation to the 


integral, 
J j fa i— 7 {= 2 : 
q)°\ 49° 4q 


1 4 4 i—1 
T [ ffu) us YD fr (=. 
q i=1 j=l q 
Note that in the function representation, it is possible, for instance, to define 
kernels that can compare images in different resolutions. 
Given a function k(x, x’), we can construct iterated kernels (e.g., [112]) using 


K(x, x!) = k(x, x!k(x!, x") dx (13.8) 


Note that k® is a positive definite kernel even if k is not, as can be seen by verifying 
the condition of Mercer’s theorem, 


1 K(x, x!) F(2e) f(x") dxdx' = I / k(x, x(x, x) f(x) f (xe!) dx”dxdx' 


= / ( f k(x, x!) f (x) ix) dx". (13.9) 


A similar construction can be accomplished in the discrete case, cf. Problem 13.3. 

According to Proposition 13.2, the product of kernels is also a kernel. Let us now 
consider a different form of product, the tensor product, which also works if the two 
kernels are defined on different domains. 


Proposition 13.6 (Tensor Products) If kų and k are kernels defined respectively on 
Xı x Xı and Xa x Xp, then their tensor product, 


(ky Q ko)(x1, X2, x4, X2) = ky (x1, x4 )ko(x2, x5), (13.10) 


is a kernel on (X1 X Xz) x (X1 x X2). Here, x1, x1 € Xı and x2, x} € Xp. 
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This result follows from the fact that the (usual) product of kernels is a kernel (see 
Proposition 13.2 and Problem 13.4). 

There is a corresponding generalization from the sum of kernels to their direct 
sum [234, 480]. 


Proposition 13.7 (Direct Sums) If kı and kz are kernels defined respectively on Xı x X1 
and Xa X Xo, then their direct sum, 


(ky B ka)(x1, X2, X1, X2) = ky(x1, x1) + ko(x2, x9), (13.11) 


is a kernel on (X1 x Xz) x (X1 x X2). Here, x4, x4 € Xı and xz, x} € Xp. 


This construction can be useful if the different parts of the input have different 
meanings, and should be dealt with differently. In this case, we can split the inputs 
into two parts x; and x2, and use two different kernels for these parts [480]. 

This observation naturally leads to the general problem of defining kernels on 
structured objects [234, 585]. Suppose the object x € X is composed of xq € Xa, 
where d = 1,..., D (note that the sets X4 need not be equal). For instance, consider 
the string x = ATG, and D = 2. It is composed of the parts x; = AT and x2 = G, 
or alternatively, of xı = A and x2 = TG. Mathematically speaking, the set of 
“allowed” decompositions can be thought of as a relation R(x1,...,%p,x), to be 
read as “x1,...,Xp constitute the composite object x.” 

Haussler [234] investigated how to define a kernel between composite objects by 
building on similarity measures that assess their respective parts; in other words, 


kernels k4 defined on X4 x X4. Define the R-convolution of ki,...,kp as 
D 
(ky x.. x kp)(x, x’) = X [| ka(xa, x4), (13.12) 
R d=1 


where the sum runs over all possible ways (allowed by R) in which we can decom- 
pose x into x1,...,xp and x’ into x},..., Xp; that is, all (x1,...,xD,%},...,Xp) such 
that R(x1,...,Xp,x) and R(x}, .. ., x, x’).1 If there is only a finite number of ways, 
the relation R is called finite. In this case, it can be shown that the R-convolution 
is a valid kernel [234]. 

Specific examples of convolution kernels are Gaussians (Problem 13.7) and 
ANOVA kernels [578, 88, 562, 529]. ANOVA stands for analysis of variance, and 
denotes a statistical technique to analyze interactions between attributes of the 
data. To construct an ANOVA kernel, we consider X = SN for some set S, and ker- 
nels k® on S x S, where i =1,...,N. For D =1,...,N, the ANOVA kernel of order 
D is defined as 


D . 
kp(x, x’) := £ KED (xxl). (13.13) 
1<ij<...<ip<N d=1 


1. Note that we use the convention that an empty sum equals zero, hence if either x or x’ 
cannot be decomposed, then (kı x . . . x kp)(x, x’) = 0. 
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Note that if D = N, the sum consists only of the term for which (i,...,ip) = 
(1,...,N), and k equals the tensor product k® @ ... @k). At the other extreme, if 
D =1, then the products collapse to one factor each, and k equals the direct sum 
k® $... kM), For intermediate values of D, we get kernels that lie in between 
tensor products and direct sums. It is also possible to use ANOVA kernels to 
interpolate between (ordinary) products and sums of kernels, cf. Problem 13.6. 

ANOVA kernels typically use some moderate value of D, which specifies the 
order of the interactions between attributes x;, that we are interested in. The sum 
then runs over the numerous terms that take into account interactions of order 
D; fortunately, the computational cost can be reduced by utilizing recurrent pro- 
cedures for the kernel evaluation [88, 529]. ANOVA kernels have been shown to 
work rather well in multi-dimensional SV regression problems (cf. Chapter 9 and 
[529]). In this case, the inputs were N-dimensional vectors, and all k“ where cho- 
sen identically as one-dimensional linear spline kernels with an infinite number of 
nodes: for x, x’ € RF, 


min(x, x)?  min(x, x’)?|x — x'| 

—— + Á- 
3 2 

Note that it is advisable to use a kernel for k™ which never or rarely takes the value 

zero, since a single zero term would eliminate the product in (13.13). Finally, it is 

possible to prove that ANOVA kernels are special cases of R-convolutions ([234], 

cf. Problem 13.8). 


K(x, x’) = +1+xx, n=1,...,N. (1314) 
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One way in which SVMs have been used for text categorization [265] is the bag-of- 
words representation, as briefly mentioned in Chapter 7. This maps a given text to 
a sparse vector, where each component corresponds to a word, and a component 
is set to one (or some other number) whenever the related word occurs in the text. 
Using an efficient sparse representation, the dot product between two such vectors 
can be computed quickly. Furthermore, this dot product is by construction a valid 
kernel, referred to as a sparse vector kernel. One of its shortcomings, however, is 
that it does not take into account the word ordering of a document. Other sparse 
vector kernels are also conceivable, such as one that maps a text to the set of pairs 
of words that are in the same sentence [265, 585], or those which look only at pairs 
of words within a certain vicinity with respect to each other [495]. 

A more sophisticated way of dealing with string data was recently proposed 
[585, 234]. The basic idea is as described above for general structured objects 
(13.12): Compare the strings by means of the substrings they contain. The more 
substrings two strings have in common, the more similar they are. The substrings 
need not always be contiguous; that said, the further apart the first and last 
element of a substring are, the less weight should be given to the similarity. 

Remarkably, it is possible to define an efficient kernel which computes the 
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dot product in the feature space spanned by all substrings of documents, with 
a computational complexity that is linear in the lengths of the documents being 
compared, and the length of the substrings. We now describe this kernel, following 
[333]. 

Consider a finite alphabet 2, the set of all strings of length n, £”, and the set of 
all finite strings, 


o0 
X = |]J. (13.15) 
n=0 
The length of a string s € X* is denoted by |s|, and its elements by s(1)...s({s]); 
the concatenation of s and t € L* is written st. Let us now form subsequences u of 
strings. Given an index sequence i := (i1,...,4),)) with 1 <i) <... < iu} < ls], we 
define u := s(i) := s(i1) . . .S(ij,;). We call Ii) := i),) — i1 +1 the length of the subsequence 
in s. Note that if i is not contiguous, then l(i) is longer than u. 

The feature space built from strings of length n is defined to be H, := R”. This 
notation means that the space has one dimension (or coordinate) for each element 
of X”, labelled by that element (equivalently, we can think of it as the space of all 
real-valued functions on X”). We can thus describe the feature map coordinate- 
wise for each u € X” via 


[Dn > A. (13.16) 
i:s(i)=u 
Here, 0 < À < 1 is a decay parameter: The larger the length of the subsequence 
in s, the smaller the respective contribution to [®,(s)],. The sum runs over all 
subsequences of s which equal u. 
For instance, consider a dimension of H3 spanned (that is, labelled) by the string 
asd. In this case, we have [®;(Nasdaq)]asa = 4°, while [®;(1ass das)]asa = 2° 2 
The kernel induced by the map ®,, takes the form 


ks, )= YS leGOw= YY YY ANON. (13.17) 
uex" uE" (i j):s(i)=t(j)=u 


To take into account strings of different lengths n, we can use linear combinations, 


=) ig, (13.18) 


with c, > 0. Let us denote the corresponding feature map by ®. Clearly, the num- 
ber and size of the terms in the above sum strongly depend on the lengths of s and 
t. Normalization of the feature map, using ®(t) /||®(t)||, is therefore recommended 
[333]. For the kernel, this implies that we should use k(s, t)//k(s, s)K(E, t). 


2. In the first string, asd is a contiguous substring. In the second string, it appears twice as 
a non-contiguous substring of length 5 in lass das, the two occurrences are lass das and 
lass das. 


414 


Designing Kernels 


To describe the actual computation of k,,, define 
kaj] y Yo AP er fori=1,...,n—1. (13.19) 
ueéd! (i j):s(i)=t(j)=u 


Using x € X1, we then have the following recursions, which allow the computation 
of k„(s,t) for all n = 1,2, . . . (note that the kernels are symmetric): 


kyest) = 1 for all s,t 
k!(s,t) = 0 if min(|s|, |t|) < i 
ki(s,t) = 0 if min(|s|, |t|) < i 
k(sx, t) = Akis, ) + ¥ kals, tl, ..., j IPAM, i=1,...,n-1 
jtj=x 
kn(sx,t) =Kn(s,t)+ $, kils, ¢[1,...,7- 11A? (13.20) 
jtj=x 


For further detail, see [585, 333, 156]. 
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Local Image 
Kernel 


13.3.1 Image Processing 


As described in Chapter 2, using a kernel k(x, x’) = (x, x’ y in an SVM classifier 
implicitly leads to a decision boundary in the space of all possible products of 
d pixels. Using all such products, however, may not be desirable, since in real- 
world images, correlations over short distances are much more reliable features 
than are long-range correlations. To take this into account, we give the following 
procedural definition of the kernel Kint (cf. Figure 13.1 and [478]): 


1. Compute a third image (x. x x’), defined as the pixel-wise product of x and x’ 

2. Sample (x. x x’) with pyramidal receptive fields of diameter p, centered at all 

locations (i, j), to obtain the values 

zij := $w (max(|i—7’|, |j — PN) Œ. * x’)inp. (13.21) 
ij 

A possible choice for the weighting function w : Ny > R is w(n) = max(q — n,0), 

where q € N. In this case, p = 2q + 1 is the width of the pyramidal receptive field. 

3. Raise each z;; to the power dj, to take into account local correlations within the 

range of the pyramid 

4. Sum a over the whole image, and raise the result to the power dz to allow for 

long-range correlations of order dz 


The resulting kernel is of order dı - d2, however it does not contain all possible 
correlations of dı - dz pixels unless dı = 1. In the latter case, we recover the standard 
complete polynomial kernel of degree dz. 
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Figure 13.1 Kernel utilizing local 


correlations in images. To compute 

k(x,x') for two images x and x’, 
<>% we sum over products between cor- 

| responding pixels of the two im- 


di ages, in localized regions (in the fig- 
ure, this is denoted by dot prod- 


LA ucts (.,.)), weighed by pyramidal 
receptive fields. A first nonlinearity 
in form of an exponent dı is ap- 
plied to the outputs. The resulting 
numbers for all patches (only two 


<,> 


| A are displayed) are summed, and the 

dth power of the result is taken as 
the value k(x, x’). This kernel corre- 
sponds to a dot product in a polyno- 
mial space which is spanned mainly 
by localized correlations between 


pixels (see Section 13.3). 


x 


Table 13.1 Summary: Error rates on the small MNIST database (Section A.1), for various 
methods of incorporating prior knowledge. In all cases, degree 4 polynomial kernels were 
used, either of the local type, or (by default) of the complete polynomial type. 


SV (complete polynomial kernel of degree 4) 
semi-local kernel ke? (Section 13.3.1) 


purely local kernel ks" (Section 13.3.1) 
Virtual SV (Section 11.3), with translations 
Virtual SV with K 


We now report experimental results. As in Section 11.4, we used the small 
MNIST database (Section A.1). As a reference, we employ the degree 4 polynomial 
SVM, performing at 4.0% error (Table 11.4). To exploit locality in images, we used 
the pyramidal receptive field kernel ke 2 with diameter p =9 and dı -dz = 4; 
these are degree 4 polynomials kernels that do not use all products of 4 pixels. 
For dı = dy = 2, we observe an improved error rate of 3.1%; a different degree 4 
kernel with only local correlations (dı = 4,d = 1) leads to an error rate of 3.4% 
(Table 13.1, [467]). 

Although better than the 4.0% error rate for the degree 4 homogeneous polyno- 
mial, this is still worse than the Virtual SV result: Using image translations to gen- 
erate a set of Virtual SVs leads to an error rate of 2.8%. As the two methods exploit 
different types of prior knowledge, however, we expect that combining them will 
lead to still better performance; and indeed, this yields the best performance of all 
(2.0%), halving the error rate of the original system. Similar results were obtained 
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on the USPS database, where the combination of VSVs and locality-improved ker- 
nels led to the best result on the original version of the dataset (see Table 7.4 and 
[478]). Similarly good results have been reported for other applications, such as 
object recognition [467, 76] and texture classification [25]. 


13.3.2 DNA Start Codon Recognition 


To illustrate the universality of the above methods, let us now look at an applica- 
tion in a rather different domain, following [617, 375]. Genomic sequences contain 
untranslated regions, and so-called coding sequences (CDS) which encode pro- 
teins. In order to extract protein sequences from nucleotide sequences, it is a cen- 
tral problem in computational biology to recognize the translation initiation sites 
(TIS), from which coding starts to determine which parts of a sequence will be 
translated. 

Coding sequences can be characterized by alignment methods using homolo- 
gous proteins (e.g., [405]), or intrinsic properties of the nucleotide sequence which 
are learned for instance with Hidden Markov models (HMMs, e.g., [256]). A dif- 
ferent approach, which has turned out to be more successful, is to model the task 
of finding TIS as a classification problem (see [406, 617]). Out of the four letter 
DNA alphabet {A,C,G, T}, a potential start codon is typically an ATG triplet. The 
classification task is therefore to decide whether a symmetrical sequence window 
around the ATG indicates a true TIS, or a so called pseudo site. Each nucleotide 
in the window is represented using a sparse four-number encoding scheme. For 
known nucleotides {A,C, G, T}, the corresponding entry is set to 1; unknown nu- 
cleotides are represented by distributions over the four entries, determined by 
their respective frequencies in the sequences. The SVM gets a training set con- 
sisting of an input of encoded nucleotides in windows of length 200 around the 
ATG, together with a label indicating true/false TIS. In the TIS recognition task, it 
turns out to be rather useful to include biological knowledge, by engineering an 
appropriate kernel function. We now give three examples of kernel modifications 
that are particularly useful for start codon recognition. 

While certain local correlations are typical for TIS, dependencies between dis- 
tant positions are of minor importance, or do not even exist. Just as in the image 
processing task described in Section 13.3.1, we want the feature space to reflect 
this. We therefore modify the kernel as follows: At each sequence position, we 
compare the two sequences locally, within a small window of length 2/ + 1 around 
that position. We sum matching nucleotides, multiplied by weights vj, which in- 
crease from the boundaries to the center of the window. The resulting weighted 
counts are raised to the d!" power. As above, dı reflects the order of local correla- 
tions (within the window) that we expect to be of importance; 


dı 

+l 

win,(x, x’) = (3 Vj match) : (13.22) 
jel 
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Table 13.2 Comparison of start codon classification errors (measured on the test set) 
achieved with different learning algorithms. All results are averages over the six data 
partitions. SVMs were trained on 8000 data points. An optimal set of parameters was 
selected according to the overall error on the remaining training data (= 3300 points); 
only these results are presented. Note that the windows consist of 2/ + 1 nucleotides. The 
neural net results were achieved by Pedersen and Nielsen ([406], personal communication). 
In the latter study, model selection seems to have involved test data, which might lead 
to slightly over-optimistic performance estimates. Positional conditional preference scores 
were calculated in a manner analogous to Salzberg [456], but extended to the same amount 
of input data supplied to the other methods. Note that all performance measures shown 
depend on the value of the classification function threshold. For SVMs, the thresholds 
are by-products of the training process; for the Salzberg method, ‘natural’ thresholds are 
derived from prior probabilities by Bayesian reasoning. Error rates denote the ratio of false 
predictions to total predictions. 


algorithm kernel parameters 


neural network 
Salzberg method 


SVM, simple polynomial 
SVM, locality-improved kernel 
SVM, codon-improved kernel 
SVM, Salzberg kernel 


Here, match, , ;(x, x’) is defined to be 1 for matching nucleotides at position p + j, 
and 0 otherwise. The window scores computed with win, are summed over the 
whole length of the sequence. Correlations between windows of order up to dz are 
taken into account by raising the resulting sum to the power of dz; that is, 


1 dh 

k(x, x") = (x ways) ; (13.23) 
p=1 

We call this kernel locality-improved, as it emphasizes local correlations. 

In an attempt to further improve performance, we incorporate another form of 
biological knowledge into the kernel, this time concerning the codon-structure of 
the coding sequence. A codon is a triplet of adjacent nucleotides that codes for 
one amino acid. By definition, the difference between a true TIS and a pseudo site 
is that downstream of a TIS, there are CDS (which shows codon structure), while 
upstream there are not. CDS and non-coding sequences show statistically different 
compositions. It is likely that the SVM exploits this difference for classification. 

We also hope to improve the kernel by reflecting the fact that CDS shifted by 
three nucleotides still look like CDS. Therefore, we further modify the locality- 
improved kernel function to account for this translation-invariance. In addition to 
counting matching nucleotides on corresponding positions, we also count matches 
that are shifted by three positions. We call this kernel codon-improved. It can be 
shown to be an admissible kernel function by explicitly deriving the monomial 
features. 
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A third modification to the kernel function is obtained by the Salzberg method, 
where we essentially represent each data point by a sequence of log odd scores, 
corresponding to two probabilities for each position: First, the likelihood that the 
observed nucleotide at that position derives from a true TIS; and second, the 
likelihood that the nucleotide occurs at the given position relative to any ATG 
triplet, either centered around true translation initiation sites, or around pseudo 
sites. We then proceed in a manner analogous to the locality-improved kernel, 
replacing the sparse representation by the sequence of these scores. As expected, 
this leads to a further improvement in classification performance. 

All three engineered kernel functions outperform both a neural net and the 
original Salzberg method, reducing the overall number of misclassifications by up 
to 25% compared with the neural network (see Table 13.2). These SVM results are 
encouraging, especially since they apply to a problem domain whose importance 
is increasing. Further successful applications of SVMs in bioinformatics have been 
reported for microarray gene expression data and other problems [81, 224, 190, 
372, 403, 584, 611]. 


13.4 Natural Kernels 


Generative model techniques such as Hidden Markov Models (HMMs), dynamic 
graphical models, or mixtures of experts, can provide a principled framework 
for dealing with missing and incomplete data, uncertainty, or variable length se- 
quences. On the other hand, discriminative models like SVMs and other kernel 
methods have become standard tools of applied machine learning, leading to 
record benchmark results in a variety of domains. A promising approach to com- 
bine the strengths of both methods, by designing kernels inspired by generative 
models, was made in the work of Jaakkola and Haussler [259, 258]. They pro- 
pose the use of a construction called the Fisher kernel, to give a “natural” similar- 
ity measure that takes into account an underlying probability distribution. Since 
defining a kernel function automatically implies assumptions about metric rela- 
tions between the examples, they argue that these relations should be defined di- 
rectly from a generative probability model p(x|6), where 6 are the parameters of 
the model. Below, we follow [388]. 


13.4.1 Natural Kernels 


To define a class of kernels derived from generative models, we need to introduce 
some basic concepts of information geometry. Consider a family of generative 
models p(x|@) (in other words, density functions), smoothly parametrized by 6 = 
(@',...,6"). These models form a manifold (called the statistical manifold) in the 
space of all probability density functions. The key idea introduced by [259] is to 
exploit the geometric structure on this manifold to obtain an induced metric for 
the training patterns x;. 
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Rather than dealing with p(x|@) directly, we use the log-likelihood instead; 
I(x, 6) := In p(x|@). For convenience, we repeat a few concepts from Chapter 3, in 
particular Section 3.3.2: 


= The derivative map of I(x|@) is called the score (cf. (3.27)) Vg: X > R, 
Vo(x) := (Opil (x, 8), ..., Ogrl(x, 6)) = Vel(x, 0) = Vo ln p(x|6), (13.24) 


the coordinates of which are taken as a ‘natural’ basis of tangent vectors. For 
example, if p(x|9) is a normal distribution, one possible parametrization is 6 = 
(44,7), where ju is the mean vector and ø is the covariance matrix of the Gaussian. 
The magnitude of the components of Vg(x) specifies the extent to which a change 
in a particular component of 0 (thus, a particular model parameter) changes the 
probability of generating the object x. The relationship of these components to 
sufficient statistics is discussed in [257]. 


= Since the manifold of In p(x|@) is Riemannian (e.g., [8]), there is a metric defined 
on its tangent space (the space of the scores), with metric tensor given by the 
inverse of the Fisher information matrix (cf. (3.28)), 


ek, [vava] , ie., I;j = Ep [3g 1n p(x|4)OiIn p(x|4)] . (13.25) 


Here, E, denotes the expectation with respect to the density p. 

This metric is called the Fisher information metric, and induces a ‘natural’ distance 
in the manifold. As we will show below, it can be used to measure the difference 
in the generative process between a pair of examples x; and x; via the score map 
Vo(x) and 17t. 


Definition 13.8 (Natural Kernel) Denote by M a strictly positive definite matrix, to be 
referred to subsequently as the natural matrix. The corresponding natural kernel is given 


by 
kig (x, x!) == Volx) M Vel’) = Vo In p(x|6)'M7!Vo In p(x'|8). (13.26) 


For M = I, we obtain the Fisher kernel; for M = 1 we obtain a kernel which we will call 
the plain kernel’. The latter is often used for convenience if I is too difficult to compute.* 


In the next section, we give a regularization theoretic analysis of the class of natural 
kernels, and in particular of k}” and ki". 


3. In [257], kernels of the form kat (x, x’) = exp(—||Vo(x) — Va(x')||?/c) are also considered. 

4. Strictly speaking, we should write svi s(x) rather than ky, since k also depends on the 
generative model, and on the parameter 6 chosen by some other procedure such as density 
estimation. In addition, note that rather than requiring M to be strictly positive definite, 
definiteness would be sufficient. We would then have to replace M = by the pseudo-inverse, 


however, and the subsequent reasoning would be more cumbersome. 
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13.4.2 The Natural Regularization Operator 


Let us briefly recall Section 4.3. In SVMs, we minimize a regularized risk func- 
tional, where the complexity term can be written as ||w||? in feature space nota- 
tion, or as ||Y ||? when considering the functions in input space. The connection 
between kernels k and regularization operators Y is given by 


k(x x) = (CHE -), PAY(%;,-))- (13.27) 


This relation states that if k is a Green’s function of Y*Y, minimizing ||w]|? in feature 
space is equivalent to minimizing the regularization term ||Yf||?. 

To analyze the properties of natural kernels k’“!, we exploit this connection 
between kernels and regularization operators by finding an associated operator 
Y47' such that (13.27) holds. To this end, we need to specify a dot product in (13.27). 
Note that this is one aspect of the choice of the class of regularization operators 
that we are looking at — in particular, we are choosing the dot product space 
into which Y maps. We opt for the dot product in the L,(p) space of real-valued 
functions, 


(f8) := Í Ff (x) g(x) p(x|A)dx, (13.28) 


since this leads to a simple form for the corresponding regularization operators. 
Measures different from p(x|@)dx are also possible, leading to different forms of Y. 


Proposition 13.9 (Regularization Operators for Natural Kernels) Given a strictly 
positive definite matrix M, a generative model p(x|6), and a corresponding natural kernel 
kyat(x, x’), YG" is an equivalent regularization operator if it satisfies 


M= f PET In p(x|A)] VIn pil0)]" pol ax. (13.29) 

Proof Substituting (13.26) into (13.27) yields 

kti xh Sy In pile) MV1 p(x'|8) (13.30) 
CE r J TEO) (13.31) 


= [Veo In p(x|0) M7 [iV 1n p(x"|9)] x 
[i Vo In p(x"10) "| MV In p(x'|@)p(x"|A)dx". (13.32) 


Note that Y%7' acts on p as a function of x” only — the terms in x and x’ are not 
affected, which is why we may collect them outside. Thus the necessary condition 
(13.29) ensures that the right hand side in (13.31) equals (13.32), which completes 
the proof. a 


Let us consider the two special cases proposed in [259]. 


Corollary 13.10 (Fisher Kernel) The Fisher Kernel (M = I) induced by a generative 
probability model with density p corresponds to a regularizer equal to the squared L>(p)- 
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norm of the estimated function, 


IFIP = IF lita (13.33) 


This can be seen by substituting in Y7*' = 1 into the rhs of (13.29), which yields the 
definition of the Fisher information matrix. 

We now explicitly describe the behavior of this regularizer. The solution of SV 
regression using the Fisher kernel has the form f(x) = D/2, aik? (x, xi), where 
the x; are the SVs, and q is the solution of the SV programming problem. By 
substitution, we obtain 


IFO = I F(x) ?p(x|O)dx (13.34) 
= f QpaiVoln pli 'Voln plail®))” pla dx, 


To understand this term, first recall that what we actually minimize is the regular- 
ized risk Ryeg[f]; in other words, the sum of (13.34) and the empirical risk given by 
the normalized negative log likelihood. The regularization term (13.34) prevents 
overfitting by favoring solutions with smaller Vg In p(x|#). Consequently, the reg- 
ularizer favors the solution which is more stable (flat). See [388] for further details. 
Note, however, that the validity of this intuitive explanation is somewhat limited: 
some effects can compensate each other, as the a; come with different signs. 

Finally, we remark that the regularization operator of the conformal transforma- 
tion of the Fisher kernel k?™ into \/p(x|4)./p(x'|@)k7""(x, x’) is the identity. 

In practice, M = 1 is often used [259]. In this case, Proposition 13.9 specializes to 
the following result. The proof is straightforward, and can be found in [388]. 


Corollary 13.11 (Plain Kernel) The regularization operator associated with the plain 
kernel k}*' is the gradient operator Vx in the case where p(x|@) belongs to the exponential 
family of densities; that is, In p(x|@) = (0, x) — n(x) + co with an arbitrary function n(x) 
and a normalization constant Co. 


This means that the regularization term can be written as 
IAP = VFO, = SIOPE, (13.35) 


thus favoring smooth functions via flatness in the first derivative. 
13.4.3 The Feature Map of Natural Kernels 


Recall Proposition 4.10, in which we constructed kernels from a discrete set of basis 
functions via (4.23), 


Keay Sealant Ae (13.36) 


where d, € {0,1} for all m, and >, a converges. Setting all d, = 1 simply means 
that we chose to keep the full space spanned by pn. Knowledge of Yn and An 
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helps us to understand the regularization properties of the kernel. In particular, 
such information tells us which functions are considered simpler than others and 
how much emphasis is placed on the individual functions w,,. We can explicitly 
construct such an expansion using linear algebra. 


Proposition 13.12 (Feature Map of Natural Kernels) Denote by I the Fisher infor- 
mation matrix, by M the natural matrix, and by s;, A; the eigensystem of M~?IM~?. 
The kernel kt (x, x’) can be decomposed into an eigensystem 


Wilx) = STM Vo In p(x|6) and A; = Aj. (13.37) 


Note that if M = I, we have \; = A; = 1. 


Proof It can be seen immediately that (13.36) is satisfied. This follows from the 
fact that s; is an orthonormal basis, 1 = $; sis}, and the definition of kit; the terms 
depending on A; cancel each other. The ~; are orthonormal, since 


(whi, pj) = [ (gavone) (T m reos) p(x|@) dx 


= M UIM A (13.38) 
JNA; 
which completes the proof.” E 


The eigenvalues \ of k? are all 1, reflecting the fact that the matrix I whitens 
the scores VgIn(p(x|@)). It can also be seen from Y; = 1 that (13.37) becomes 
Pix) = Fs; - Veln(p(x|6)),1 Si <r. 

What are the consequences of all eigenvalues being equal? Standard VC dimen- 
sion bounds (e.g., Theorem 5.5) state that the capacity of a linear class of functions 
is bounded by R*A?. Here, R is the radius of the smallest sphere containing the 
data (in feature space), and A is the maximum allowed length of the weight vector. 
Recently, it has been shown that both the spectrum of an associated integral op- 
erator (Section 12.4.1) and the spectrum of the Gram matrix k((x;, x;));; [606, 477] 
can be used to formulate tighter generalization error bounds, exploiting the fact 
that for standard kernels, such as Gaussians, the distribution of the data in feature 
space is rather non-isotropic, and the sphere bound is wasteful. 

For the Fisher kernel, the non-isotropy does not occur, since the Fisher matrix 
whitens the scores. This suggests that the standard isotropic VC bounds should be 
fairly precise in this case. Moreover, the flat spectrum of the Fisher kernel suggests 
a way of comparing different models: if we compute the Gram matrix for a set of 
models p(x|6/), then we expect for the true model that A; = 1 for all i. In [388], it is 
shown experimentally that this can be used for model selection. 


5. This result may be extended to generic kernels of the form k(x, x’) = UT (x)MU [597]. 
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13.6 Problems 


In this chapter, we collected a fair amount of material concerning the design of 
kernels. We started with generic recipes for constructing kernels, and then moved 
on to more specific methods which take into account features of a given problem 
domain. As examples, we considered sequence processing, where we discussed 
string kernels, and image recognition, where we introduced kernels that respect 
local structure in images. Similar techniques also prove useful in DNA start codon 
recognition. In both cases, these locality-improved kernels lead to substantial 
improvements in performance. They are applicable in all cases where the relative 
importance of subsets of products between features can be specified appropriately. 
They do, however, slow down both training and testing by a constant factor, which 
depends on the cost of evaluating the specific kernel used. Finally, we described 
and analyzed the Fisher kernel method, which designs a kernel respecting an 
underlying generative model. Further methods for constructing kernels had to 
be omitted due to lack of space; examples being the fairly well developed theory 
of kernels on groups (see Problem 13.11) and a recent modification of the Fisher 
kernel [548]. 

As explained at the outset, the choice of kernel function is crucial in all kernel 
algorithms. The kernel constitutes prior knowledge that is available about a task, 
and its proper choice is thus crucial for success. Although the question of how to 
choose the best kernel for a given dataset is often posed, it has no good answer. 
Indeed, it is impossible to come up with the best kernel on the basis of the dataset 
— the kernel reflects prior knowledge, and the latter is, by definition, knowledge 
that is available in addition to the empirical observations. 


13.1 (Powers of Kernels e) Using Proposition 13.2, prove that if k is a kernel and d € N, 
then k’ is a kernel. 


13.2 (Power Series of Dot Products e) Prove that the kernel defined in (13.3) is posi- 
tive definite if for all n, we have a, > 0 (cf. Theorem 13.4). 


13.3 (Iterating Kernels e) Let z1, ..., Zn € X, and k(x, x’) be a function on X. Prove that 
n 
KP (x, x) := X k(x, z k(x’, zj) (13.39) 
j=1 
is positive definite in the sense of Definition 2.5. 


13.4 (Tensor Products ee) Prove Proposition 13.6. Hint: represent the tensor product 
kernel as the product of two simpler kernels, and use Proposition 13.2. 
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13.5 (Direct Sums e) Prove Proposition 13.7. 


13.6 (Diagonal Projection ee) If k(x1, x2, x}, x5) is a kernel on X? x X?, then the diag- 
onal projection is defined as (e.g., [234]) 
2) := k(x, x, x', x’). (13.40) 


Prove that if kı and ky are kernels on X x X, we have (ky & k2)^ = kıkz and (kı ® k2)^ = 
kı + k2. Consider as a special case the ANOVA kernel (13.13), and compute its diagonal 
projections when D = N and D =1. 


13.7 (Gaussian Kernels as R-Convolutions [234] e) Consider (13.12) for X = X1 x 
... X Xp. In this case, each composite object x is simply a D-tuple consisting of the 


components x1,...,Xp. Show that kı x.. .x kp(x, x’) = TER ka(xa, x1). Next, specialize 
this result to the case where ford =1,...,D, Xa = R, considering the one-dimensional 
Gaussian kernel kq(xa, x4) = exp(—(xa — x4)? /ca) with ca > 0. Show that the convolution 
of ki,...,kp is a multi-dimensional Gaussian kernel (cf. Chapter 2). 


13.8 (ANOVA Kernels ee) Prove that ANOVA kernels (13.13) are positive definite, 
either directly from the definition, or by showing that they are special cases of R- 
convolutions. 


13.9 (Kernels of Sets [234] e) Let X denote the set of all finite subsets of X. Prove that if 
k isa kernel on X x X, then 


k(A,B):= X$, k(x, x’) (13.41) 
xEA,x'EB 


is a kernel on X x X. Hint: consider the feature map ®(A) := $ pe 4 P(x), where ® is the 
feature map induced by k. 


13.10 (Weighted Kernels of Sets ee) Generalize the construction of the previous prob- 
lem to allow for 


(A) = X, w(x)@(x), (13.42) 
xEA 


where w is some nonnegative function on X. Consider the case where w takes values in 
{0,1} only, and discuss the connection to the R-convolution kernel [234]. 


13.11 (Kernels on Groups ee) Let G be a group, and g,g' € G. Consider a kernel of the 
form k(g, 9’) = h((g’)~'g) where the function h : G — C is chosen such that k is positive 
definite [219]. Such functions are called positive definite (cf. also Definition 2.29). 

1. Prove that h is Hermitian, that is, h(g—!) = h(g) 

2. Prove that |h(g)| < h(e), where e is the neutral element of the group 


3. Prove that finite products of positive definite functions are again positive definite 
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4. Consider the special case where the group is (IRN ,+). Construct a positive definite 
kernel on RN via k(g,g') = h((g’)~'g), where h is some positive definite function (e.g., 
one of the cpd functions of order 0 in Table 2.1). 


13.12 (The Kernel Map as a GNS Representation ee) Let H be a complex Hilbert 
space. Using the notation of Problem 13.11, consider complex-valued functions of the form 


h(g) = (U(g)v, v), (13.43) 


where v € K, and U: G + L(K) is a unitary representation of G. Verify that k(g, g’) := 
h((g')~1g) is a positive definite kernel on G x G. Show that ® : g + U(g)v is a valid 
feature map for k, i.e., that k(g,g’) = (P(g), P(e’). 

Next, consider the converse. Show that given any positive definite function h on G, we 
can associate a complex Hilbert space Hy, a unitary representation U, of G in Hy, and a 
(cyclic) unit vector vp, such that 


h(g) = (Ui(8)Vns Vn) (13.44) 


holds true. Proceed as follows: as the Hilbert space, use the RKHS associated with k; as vj, 
use the function k(.,e) = h(e~!-) = h(.) (recall that e is G’s neutral element); finally, for 
f € Hy, define the representation U, as 


UDANE) = f(g’). (13.45) 


Using these definitions, verify (13.44). 
This representation is called the Gelfand-Naimark-Segal (GNS) construction. 


13.13 (Conditional Symmetric Independence Kernels [585] ee) Consider the fea- 
ture map 


(x) := [peled] (13.46) 


where p(x|c;) is the probability that some discrete random variable X takes the value x, 
conditional on C having taken the value c;. Compute the kernel induced by this feature 
map, and interpret it as a probability distribution. What kind of distributions can be 
expressed in this way? A potentially rather useful class of such distributions arises from 
pair hidden Markov models [585]. 


13.14 (String Kernel Recursion [585, 333] eee) Prove that the recursion (13.20) al- 
lows the computation of (13.17). 


13.15 (Local String Kernels 000) Construct local string kernels by transferring the idea 
described in Section 13.3 to the kernels described in Section 13.2. How does this relate to 
the locality induced by the decay parameter A? 


13.16 (String Kernels Penalizing Excess Length ee) Note that the longer a matching 
sequence is, the less it contributes to the comparison, if \ < 1. Design a string kernel that 
does not suffer from this drawback, by choosing the c; in (13.18) such that the overall kernel 
only penalizes ‘excess’ length. 
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13.17 (Two-Dimensional Sequence Kernels 000) Can you generalize the ideas put 
forward for sequences in Section 13.2 to two-dimensional structures, such as images? Can 
you, for instance, construct a kernel that assesses the similarity of two images according to 
the number and contiguity of common sub-images? 


13.18 (Regularization for Composite Kernels 000) Denote by ® : X > KH a feature 
map with known regularization operator Y, i.e., there exists some Y and a space with dot 
product {-,+) such that 


(V(x, +), k(x’, -)) = k(x, x’). 


Furthermore denote by ®' : H —> H’ a second feature map, also with known regularization 
operator Y”. 
Can you construct the composite regularization operator Ý corresponding to 


K(x, x!) = (®"(@(x)), P(E’) 


from this information? 
As a special case, let ® be the score map of Section 13.4.1, ®’ the map obtained by 
applying a Gaussian kernel to B(x), B(x’). 


13.19 (Tracy-Widom Law for the Fisher Kernel Matrix 000) If the distribution in 
feature space is spherical and Gaussian, the Tracy-Widom law describes the average and 
standard deviation of the largest Eigenvalue in the Covariance matrix [544, 524, 270]. 
Noting that the distribution in the Fisher kernel feature space is spherical (but not neces- 
sarily Gaussian), study how accurately the Tracy-Widom law holds in this case. 
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The idea of implicitly mapping the data into a high-dimensional feature space 
has been very fruitful in the context of SV machines. Indeed, as described in 
Chapter 1, it is this feature which distinguishes them from the Generalized Portrait 
algorithm, which has been around since the sixties [573, 570], and which makes 
SVMs applicable to complex real-world problems that are not linearly separable. 
Thus, it is natural to ask whether the same idea might prove useful in other 
domains of learning. 

The present chapter describes a kernel-based method for performing a nonlinear 
form of Principal Component Analysis, called Kernel PCA. We show that through 
the use of positive definite kernels, we can efficiently compute principal compo- 
nents in high-dimensional feature spaces, which are related to input space by some 
nonlinear map. Furthermore, the chapter details how this method can be embed- 
ded in a general feature extraction framework, comprising classical algorithms 
such as projection pursuit, as well as sparse kernel feature analysis (KFA), a kernel 
algorithm for efficient feature extraction. 

After a short introduction to classical PCA in Section 14.1, we describe how 
to transfer the algorithm to a feature space setting (Section 14.2). In Section 14.3, 
we describe experiments using Kernel PCA for feature extraction. Following this, 
we introduce a general framework for feature extraction (Section 14.4), for which 
Kernel PCA is a special case. Another special case, sparse KFA, is discussed in 
more detail in Section 14.5, with particular focus on efficient implementation. 
Section 14.6 presents toy experiments for sparse KFA. 

Most of the present chapter only requires knowledge of linear algebra, including 
matrix diagonalization, and some basics in statistics, such as the concept of vari- 
ance. In addition, knowledge of positive definite kernels is required, as described 
in Chapter 1, and, in more detail, in Chapter 2. Section 14.5 requires slightly more 
background information, and builds on material explained in Chapters 4 and 6. 


14.1 Introduction 


Principal Component Analysis (PCA) is a powerful technique for extracting struc- 
ture from possibly high-dimensional data sets. It is readily performed by solving 
an eigenvalue problem, or by using iterative algorithms which estimate princi- 
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pal components. For reviews of the existing literature, see [271, 140]; some of the 
classical papers are [404, 248, 280]. PCA is an orthogonal transformation of the 
coordinate system in which we describe our data. The new coordinate system is 
obtained by projection onto the so-called principal axes of the data. The latter are 
called principal components or features. A small number of principal components is 
often sufficient to account for most of the structure in the data. These are some- 
times called factors or latent variables of the data. 

Let us begin by reviewing the standard PCA algorithm. In order to be able 
to generalize it to the nonlinear case, we formulate it in a way which uses dot 
products exclusively. 


Given a set of observations x; € RY,i=1,...,m, which are centered, 5”, x; = 0, 
PCA finds the principal axes by diagonalizing the covariance matrix!, 
1 m T 
C= zÈ xxj. (14.1) 


Note that C is positive definite, and can thus be diagonalized with nonnegative 
eigenvalues (Problem 14.1). To do this, we solve the eigenvalue equation, 


dv = Co, (14.2) 
for eigenvalues \ > 0 and nonzero eigenvectors v € R \ {0}. Substituting (14.1) 
1. More precisely, the covariance matrix is defined as the expectation of xx"; for conve- 


nience, we use the same term to refer to the estimate (14.1) of the covariance matrix froma 
finite sample. 
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into this expression, 
1 m 
N= Cos m & HH?) tj (14.3) 


we see that all solutions v with \ Æ 0 lie in the span of x; . . . Xm, hence for these 
solutions (14.2) is equivalent to 


AAR 0) = (xi, Co} forall? = 1,...,m. (14.4) 
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We now study PCA in the case where we are not interested in principal compo- 
nents in input space, but rather principal components of variables, or features, 
which are nonlinearly related to the input variables. These include variables ob- 
tained by taking arbitrary higher-order correlations between input variables, for 
instance. In the case of image analysis, this amounts to finding principal compo- 
nents in the space of products of pixels. 

As described in Chapter 2, the kernel trick enables us to construct different 
nonlinear versions of any algorithm which can be expressed solely in terms of 
dot products (thus, without explicit usage of the variables themselves). 


14.2.1 Nonlinear PCA as an Eigenvalue Problem 


Let us consider a feature space H (Chapter 2), related to the input domain (for 
instance, R“) by a map 


O:X SH, xe G(x), (14.5) 


which is possibly nonlinear. The feature space H could have an arbitrarily large, 
and possibly infinite, dimension. Again, we assume that we are dealing with 
centered data, ¥", ®(x;) = 0 — we shall return to this point later. In H, the 
covariance matrix takes the form 


C= 1 5 (x) O(x))". (14.6) 
j=l 


If H is infinite-dimensional, we think of ®(x;)®(x;)' as a linear operator on H, 
mapping x +> ®(x;) (®(x;),x). 

We now have to find eigenvalues À > 0 and nonzero eigenvectors v € H \ {0} 
satisfying 


Av = Cv. (14.7) 
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Again, all solutions v with A Æ 0 lie in the span of ®(x,),..., ®(x,). For us, this has 
two useful consequences: first, we may instead consider the set of equations 


A (®(Xn), v) = (P(xn), Cv) for alln =1,...,m, (14.8) 

and second, there exist coefficients a; (i = 1, . . ., m) such that 

v= ai®(x;). (14.9) 
i=1 


Combining (14.8) and (14.9), we get 


m 1 m m 

AY, aj (®(xn), ®(x;)) = m $ ai (0n $ D(x) (O(x)), oc) ) (14.10) 
i=1 i=l j=1 

for all n =1,...,m. In terms of the m x m Gram matrix Kj; := (®(x;), ®(x;)) , this 
reads 


m\Ka = Ka, (14.11) 


where œ denotes the column vector with entries a1, ...,@m. To find solutions of 
(14.11), we solve the dual eigenvalue problem, 


ma = Ka, (14.12) 


for nonzero eigenvalues. It can be shown that this yields all solutions of (14.11) 
that are of interest for us (Problem 14.14). 

Let \y > Ap >... > Am denote the eigenvalues of K (in other words, the solutions 
mx of (14.12)), and a',...,a" the corresponding complete set of eigenvectors, 
with A, being the last nonzero eigenvalue (assuming that ® is not identically 0). 
We normalize a!,...,a’ by requiring that the corresponding vectors in H (see 
(14.9)) be normalized, 


v",v") = 1 foralln =1,...,p. (14.13) 
? ? +P 


By virtue of (14.9) and (14.12), this translates to a normalization condition for 
a,...,a?, 


m m 
1= $ aja; (D(x), O(x))) = J, ara} Kij 
K i 


= (a", Ka") = àla", a"). (14.14) 


For the purpose of principal component extraction, we need to compute projec- 
tions onto the eigenvectors v” in H (n =1,...,p). Let x be a test point, with an 
image (x) in H. Then 


(v", &(x)) = ¥ a! (®(x,), D) (14.15) 
i=1 
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are the nonlinear principal components (or features) corresponding to ®.” 

Let us summarize the algorithm. To perform kernel-based PCA (Figure 14.1), 
henceforth referred to as Kernel PCA, the following steps are carried out. First, we 
compute the Gram matrix Kj; = k(x;, x;);;. Next, we diagonalize K, and normalize 
the eigenvector expansion coefficients a" by requiring Àn (a@", a") = 1. Finally, to 
extract the principal components (corresponding to the kernel k) of a test point x, 
we then compute projections onto the eigenvectors by 


m 


(v" O(x)) = ¥ afk(xi,x), n=1,...,p; (14.16) 
i=l 


see (14.15) and Figure 14.2. 

For the sake of simplicity, we have so far made the assumption that the observa- 
tions are centered. This is easy to achieve in input space, but more difficult in K, 
as we cannot explicitly compute the mean of the mapped observations in H. There 
is a way to do it, however, and this leads to slightly modified equations for kernel 
PCA. It turns out (Problem 14.5) that we then need to diagonalize 


Ky = (K = 1K — Klim + Lin K1m) (14.17) 


1]9 

using the notation (1,,);;:= 1/m for all i, j. 
Kernel PCA based on the centered matrix K can also be performed with the 

larger class of conditionally positive definite matrices. This is due to the fact 

that when we center the data, we make the problem translation invariant (cf. 

Proposition 2.26). 


14.2.2 Properties of Kernel PCA 


We know that Kernel PCA corresponds to standard PCA in some high-dimensional 
feature space. Consequently, all mathematical and statistical properties of PCA (cf. 
[271, 140]) carry over to Kernel PCA, with the modification that they become state- 
ments about a set of points ®(x;),i=1,...,m,in H, rather than in RY. 


Proposition 14.1 (Optimality Properties of Kernel PCA) Kernel PCA is the orthog- 
onal basis transformation in H with the following properties (assuming that the eigenvec- 
tors are sorted in descending order of eigenvalue size): 


= The first q (q € {1,...,m}) principal components, or projections on eigenvectors, carry 
more variance than any other q orthogonal directions 


= The mean-squared approximation error in representing the observations in H by the first 
q principal components is minimal (over all possible q directions) 


2. Note that in our derivation, we could have used the known result (e.g., [297]) that PCA 
can be carried out on the dot product matrix (x;,x;),, instead of (14.1). For the sake of clarity 
and ease of extendability (regarding centering the data in H), however, we gave a detailed 
derivation. 
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linear PCA K(x%x’)=<x,x> Figure 14.1 The basic idea of Kernel 
é PCA. In some high-dimensional feature 


space H (bottom right), we perform lin- 
ear PCA, as with classical PCA in in- 
put space (top). Since H is nonlinearly 
related to input space (via ®), the con- 
tour lines of constant projections onto the 
principal eigenvector (drawn as an ar- 


kernel PCA e.g k(xx’)=<x xe row) are nonlinear in input space. We can- 


not draw a pre-image of the eigenvector 
in input space, as it may not even ex- 
ist. Crucial to Kernel PCA is the fact that 
there is no need to perform the map into 
H: all necessary computations are car- 
ried out using a kernel function k in input 
space (here: R°). 


z feature value 
(v,P(x)) =È a; k(x;,x) 


weights (eigenvector 
Oy Oy Q3 %4 coefficients) 
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Figure 14.2 Feature extractor constructed using Kernel PCA (cf. (14.16)). In the first layer, 
the input vector is compared to the sample via a kernel function, chosen a priori (e.g. 
polynomial, Gaussian, or sigmoid). The outputs are then linearly combined using weights, 
which are found by solving an eigenvector problem. As shown in the text, the function of 
the network can be thought of as the projection onto an eigenvector of a covariance matrix 
in a high-dimensional feature space. As a function on input space, it is nonlinear. 


a The principal components are uncorrelated 


= The first q principal components have maximal mutual information (see [140, 132]) with 
respect to the inputs (this holds under Gaussianity assumptions in K, and thus strongly 
depends on the particular kernel chosen and on the data) 


Proof All these statements are completely analogous to the case of standard 
PCA. As an example, we prove the second property, in the simple case where 
the data x1,...,X,, in feature space are centered. We consider an orthogonal basis 
transformation W, and use the notation P, for the projector on the first q canonical 
basis vectors {e;,...,@,}. Then the mean squared reconstruction error using g 
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vectors is 
L S- WTP, Wa? = ŻE [Wx — P, Wo? 

= TEE (Wane) 

“AEE (0s) 

= TEX (Weim) (xi, WTe;) 

= > (W'e;,CWTe;). (14.18) 

1>4 


It can easily be seen that the values of this quadratic form (which gives the vari- 
ances in the directions W7 e;) are minimal if the W” e; are chosen as its (orthogonal) 
eigenvectors with smallest eigenvalues. o 


To translate these properties of PCA in K into statements about the data in input 
space, we must consider specific choices of kernel. One such input space character- 
istic is the invariance property of kernels depending only on (x, x’) (cf. Section 2.3). 

Can we guarantee that this algorithm works well, particularly in high-dimensi- 
onal spaces (cf. Problem 14.9)? It is possible to draw some simple analogies to 
the standard SV reasoning. The feature extractors (14.16) are linear functions in 
the feature space H, with regularization properties characterized by the length of 
their weight vector v, as in the SV case. When applied to the training data, the 
nth feature extractor generates a set of outputs with variance An. Dividing each 
coefficient vector œ” by Vn, We obtain a set of nonlinear feature extractors with 
unit variance output, and the following interesting property: 


Proposition 14.2 (Connection KPCA — SVM [480]) For all n € {1,...,p}, the nth 
Kernel PCA feature extractor, scaled by 1/Xn, is optimal among all feature extractors of 
the form f(x) = X; aik(xi, x) (cf. (14.16)), in the sense that it has minimal weight vector 
norm in the RKHS H, 


m 
Ivl? = X, œiajkai,x;), (14.19) 
i,j=1 
subject to the conditions that 
(1) it is orthogonal to the first n — 1 Kernel PCA feature extractors (in feature space), and 
(2) it leads to a unit variance set of outputs when applied to the training set X1,...,Xm- 


Therefore, Kernel PCA can be considered a method for extracting potentially 
interesting functions that have low capacity. Here, “interestingness” is ensured by 
the unit variance, and capacity is measured by the length of the weight vector. 
As discussed in Section 16.3, this capacity measure is identical to that used in 
Gaussian processes, hence it could be interpreted as a Bayesian prior on the 
space of functions by setting p(f) x exp(—3||Yf||*), where Y is the regularization 
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operator corresponding to k (see Theorem 4.9 for details). From this perspective, 
the first extractor (cf. (14.16)) f(x) = Xi- aik(x;, x) is given by 


f =argmaxexp (=z) ; (14.20) 
Var(f)=1 2 

where Var(f) denotes the (estimate of the) variance of f(x) for x drawn from the 

underlying distribution. We return to this topic in Section 14.4, where we use 

Proposition 14.2 as the basis of a general feature extraction framework. 

Unlike linear PCA, the method proposed allows the extraction of a number of 
principal components which can exceed the input dimensionality. Suppose that the 
number of observations m exceeds the input dimensionality N. Linear PCA, even 
when it is based on the m x m dot product matrix, can find at most N nonzero 
eigenvalues — the latter are identical to the nonzero eigenvalues of the N x N 
covariance matrix. By contrast, Kernel PCA can find up to m nonzero eigenvalues 
— a fact that illustrates that it is impossible to perform kernel PCA directly on an 
N x N covariance matrix. 

Being just a basis transformation, standard PCA allows the reconstruction of the 
original patterns x;,i = 1,...,m, from a complete set of extracted principal com- 
ponents (x;, vj}, j = 1,...,m, by expansion in the eigenvector basis. Even using 
an incomplete set of components, good reconstruction is often possible. In Kernel 
PCA, this is more difficult: we can reconstruct the image of a pattern in H from its 
nonlinear components; if we only have an approximate reconstruction, however, 
there is no guarantee that we can find an exact pre-image of the reconstruction in 
input space. In this case, we have to resort to approximations (cf. Chapter 18). 


14.2.3 Comparison to Other Methods 


Starting from some of the properties characterizing PCA, it is possible to develop 
a number of generalizations of linear PCA to the nonlinear case. Alternatively, we 
may choose an iterative algorithm which adaptively estimates principal compo- 
nents, and make some of its parts nonlinear to extract nonlinear features. Rather 
than giving a full review of this field here, we briefly describe four approaches, 
and refer the reader to [140] for more detail. 

Beginning with the pioneering work of Oja [387], a number of unsupervised 
neural-network type algorithms to compute principal components have been pro- 
posed (for instance, [457]). Compared with the standard approach of diagonalizing 
the covariance matrix, they have advantages in cases where the data are nonsta- 
tionary. Nonlinear variants of these algorithms are obtained using nonlinear neu- 
rons. The algorithms then extract features, referred to by the authors as nonlinear 
principal components. These approaches, however, do not have the geometrical 
interpretation of Kernel PCA as being standard PCA in a feature space nonlinearly 
related to input space, and it is thus more difficult to understand what exactly they 
are extracting. For a discussion of some approaches, see [279]. 

Next, consider a linear perceptron with one hidden layer, which is smaller 
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than the input (that is, the dimension of the data). If we train it to reproduce 
the input values as outputs (in other words, we use it in autoassociative mode), 
then the hidden unit activations form a lower-dimensional representation of the 
data, closely related to PCA (see for instance [140]). To generalize to a nonlinear 
setting, we use nonlinear neurons and additional layers. While this can of course 
be considered a form of nonlinear PCA, it should be stressed that the resulting 
network training consists of solving a hard nonlinear optimization problem, with 
the possibility of getting trapped in local minima. Additionally, neural network 
implementations often pose a risk of overfitting. Another drawback of neural 
approaches to nonlinear PCA is that the number of components to be extracted 
has to be specified in advance. As an aside, note that hyperbolic tangent kernels 
can be used to extract neural network type nonlinear features using Kernel PCA 
(Figure 14.6). As in Figure 14.2, the principal components of a test point x in this 
case take the form >; a/tanh(«(x;,x) + ©). 

An approach with a clear geometric interpretation in input space is the method 
of principal curves [231], which iteratively estimates a curve (or surface) that cap- 
tures the structure of the data. The data are projected onto a curve, determined by 
the algorithm, with the property that each point on the curve is the average of all 
data points projecting onto it. It can be shown that the only straight lines with the 
latter property are principal components, so principal curves are indeed a gener- 
alization of standard PCA. To compute principal curves, a nonlinear optimization 
problem must be solved. The dimensionality of the surface, and thus the number 
of features to extract, is specified in advance. Some authors [434] discuss parallels 
between the Principal Curve algorithm and self-organizing feature maps [302] for 
dimensionality reduction. For further information, and a kernel-based variant of 
the principal curves algorithm, cf. Chapter 17. 

Kernel PCA is a nonlinear generalization of PCA in the sense that (a) it performs 
PCA in feature spaces of arbitrarily large (possibly infinite) dimensionality, and 
(b) if we use the kernel k(x, x’) = (x,x’), we recover the original PCA algorithm. 
Compared with the above approaches, the main advantage of Kernel PCA is that 
no nonlinear optimization is involved — it is essentially linear algebra, as with 
standard PCA. In addition, we need not specify the number of components that 
we want to extract in advance. Compared with principal curves, Kernel PCA is 
harder to interpret in input space; however, for polynomial kernels at least, it 
has a very clear interpretation in terms of higher-order features. Compared with 
neural approaches, Kernel PCA can be disadvantageous if we need to process a 
very large number of observations, as this results in a large matrix K. It is possible, 
however, to use sparse greedy methods to perform Kernel PCA approximately 
(Section 10.2). 


3. Simply using nonlinear activation functions in the hidden layer does not suffice: the 
linear activation functions already lead to the best approximation of the data (given the 
number of hidden nodes), so for the nonlinearities to have an effect on the components, the 
architecture needs to be changed (see [140)). 
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All these techniques provide nonlinear feature extractors defined on the whole 
input space. In other words, they can be evaluated on patterns regardless of 
whether these are elements of the training set or not. Some other methods, such as 
the LLE algorithm [445] and multidimensional scaling (MDS) [116], are restricted to 
the training data. They aim to only provide a lower-dimensional representation of 
the training data, which is useful for instance for data visualization. 

Williams [598] recently pointed out that when considering the special case 
where we only extract features from the training data, Kernel PCA is actually 
closely connected to MDS. In a nutshell, MDS is a method for embedding data 
into R, based on pairwise dissimilarities. Consider a situation where the dissim- 
ilarities are actually Euclidean distances in R" (N > q). In the simplest variant of 
MDS (“classical scaling”), we attempt to embed the training data into IR’ such that 
the squared distances A}, := ||x; — x;||? between all pairs of points are (on average) 
preserved as well as possible. It can be shown from Proposition 14.1 that this is 
readily achieved by projecting onto the first q principal components. 

In metric MDS, the dissimilarities A;; are transformed by a (nonlinear) function 
ġ before the embedding is computed. In this case, the computation of the embed- 
ding involves the minimization of a nonlinear “stress” function, which consists of 
the sum over all mismatches. Usually, this stress function is minimized using non- 
linear optimization methods. This can be avoided for a large class of nonlinearities 
o, however. Williams [598] showed that the metric deste solution is a by- sere 
of performing kernel PCA with RBF kernels, k(x;, xj) = (||xi— xI) = ¢(Ai,).4 
this case, we thus get away with solving an oe problem. 

The second of the aforementioned dimensionality reduction algorithms, LLE, 
can also be related to kernel PCA. One can show that one obtains the solution 
of LLE by performing kernel PCA on the Gram matrix computed from what we 
might call the locally linear embedding kernel. This kernel assesses similarity of two 
patterns based on the similarity of the coefficients required to represent the two 
patterns in terms of neighboring patterns. For details, see Problem 14.16. 

We conclude this section by noting that it has recently been pointed out that one 
can also connect kernel PCA to orthogonal series density estimation [200]. The kernel 
PCA eigenvalue decomposition provides the coefficients for a truncated density 
estimator expansion taking the form p(x) = Xf An (4 Xit a" (v", ®(x))) , where 


m 
q is the number of components taken into account, and a} and v are defined (and 


4. One way of performing metric MDS is to first apply @, and then run classical MDS 
on the resulting dissimilarity matrix. An interesting class of nonlinearities is the power 
transformation @(Aj;) = Ni, where u > 0 ([127], cited after [598]). Provided the original 
dissimilarities Aj; arise from Euclidean distances, the power transformation generally leads 
to a conditionally positive definite matrix (~ol) i if and only if u < 1 (cf. (2.81)). The 
centered version of this matrix, which is used in MDS, is thus positive definite if and only if 
H <1 (cf. Proposition 2.26). Therefore, it is exactly in these cases that we can run classical 
MDS after applying @ without running into problems. This answers a problem posed by 
[127], for the case of Euclidean dissimilarities. 
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Figure 14.3 Two-dimensional toy example, with data generated as follows: x-values have 
uniform distribution in [—1, 1], y-values are generated from y; = x? + v, where v is normal 
noise with standard deviation 0.2. From left to right, the polynomial degree in the kernel 
(14.21) increases from 1 to 4; from top to bottom, the first 3 eigenvectors are shown, in order 
of decreasing eigenvalue size (eigenvalues are normalized to sum to 1). The figures contain 
lines of constant principal component value (contour lines); in the linear case (d = 1), these 
are orthogonal to the eigenvectors. We did not draw the eigenvectors, as in the general case, 
they belong to a higher-dimensional feature space. Note, finally, that for d = 1, there are only 
2 nonzero eigenvectors, this number being equal to the dimension of the input space. 


normalized) as in Section 14.2.1. This work builds on the connection between the 
eigenfunctions of the integral operator Tą associated with the kernel k and the 
eigenvectors of the Gram matrix (see Problem 2.26). 
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Toy Examples 


In this section, we present a set of experiments in which Kernel PCA is used (in the 
form taking into account centering in H) to extract principal components. First, 
we take a look at a simple toy example; following this, we describe real-world 
experiments where we assess the utility of the extracted principal components in 
classification tasks. 

To provide insight into how PCA in behaves in input space, we describe a set 
of experiments with an artificial 2-D data set, using polynomial kernels, 


k(x, x^) = (x, x"), (14.21) 
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Figure 14.4 Two-dimensional toy example with three data clusters (Gaussians with stan- 
dard deviation 0.1; depicted region, [—1, 1] x [—0.5, 1]): first 16 nonlinear principal compo- 
nents extracted with k(x, x’) = exp (—||x — x’|?/0.1). Note that the first 2 principal compo- 
nent (top left), which possess the largest eigenvalues, nicely separate the three clusters. The 
components 3-5 split the clusters into halves. Similarly, components 6-8 split them again, 
in a manner orthogonal to the above splits. The higher components are more difficult to 
describe. They look for finer structure in the data set, identifying higher-order moments. 


of degree d = 1,...,4 (see Figure 14.3). Linear PCA (on the left) leads to just 2 
nonzero eigenvalues, as the input dimensionality is 2. By contrast, nonlinear PCA 
allows the extraction of further components. In the figure, note that nonlinear 
PCA produces contour lines of constant feature value, which reflect the structure 
in the data better than in linear PCA. In all cases, the first principal component 
varies monotonically along the parabola that underlies the data. In the nonlinear 
cases, the second and the third components also show behavior which is similar 
across different polynomial degrees. The third component, which comes with 
small eigenvalues (rescaled to sum to 1), seems to pick up the variance caused by 
the noise, as can be seen in the case of degree 2. Dropping this component would 
thus amount to noise reduction. 

Further toy examples, using radial basis function kernels (2.68) and neural 
network type sigmoid kernels (2.69), are shown in Figures 14.4-14.6. 

In Figure 14.7, we illustrate the fact that Kernel PCA can also be carried out 
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el r Figure 14.5 A plot of the data rep- 

d resentation given by the first two 
principal components of Figure 14.4. 
The clusters of Figure 14.4 end up 
roughly on separate lines (the left, 
right, and top regions correspond to 
the clusters left, top, and right, re- 
spectively). Note that the first com- 
ponent (the horizontal axis) already 
separates the clusters — this cannot 
be done using linear PCA. 


-2} 


a 


Figure 14.6 A smooth transition from linear PCA to nonlinear PCA is obtained using 
hyperbolic tangent kernels k(x, x’) = tanh (« (x, x") +1) with varying gain «: from top to 
bottom, & = 0.1,1,5, 10 (data as in the previous figures). For « = 0.1, the first two features 
look like linear PCA features. For large «, the nonlinear region of the tanh function comes 
into play. In this case, kernel PCA can exploit the nonlinearity to allocate the highest feature 
gradients to regions where there are data points, as can be seen in the case « = 10. 


440 


USPS Character 
Recognition 


Kernel Feature Extraction 


Figure 14.7 Kernel PCA on a toy dataset using the cpd kernel (2.81); contour plots of the 
feature extractors corresponding to projections onto the first two principal axes in feature 
space. From left to right: 6 = 2,1.5,1,0.5. Notice how smaller values of / make the feature 
extractors increasingly nonlinear, which allows the identification of the cluster structure 
(from [468]). 


using conditionally positive definite kernels. We use the kernel k(x, x’) = —||x — 
x'||° (2.81). As detailed in Chapter 2, algorithms that are translation-invariant in 
feature space can utilize cpd kernels. Kernel PCA is such an algorithm, since any 
translation in feature space is removed by the centering operation. Note that the 
case 3 = 2 is actually equivalent to linear PCA. As we decrease (3, we obtain 
increasingly nonlinear feature extractors. As the kernel parameter 8 gets smaller, 
we also get more localized feature extractors (in the sense that the regions where 
they have large gradients, corresponding to dense sets of contour lines in the plot, 
get more localized). This could be interpreted as saying that smaller values of 8 
put less weight on large distances, thus yielding more robust distance measures. 

These toy experiments serve illustrative purposes, but they are no substitute for 
experiments on real-world data. Thus, we next report a study on a handwritten 
character recognition problem, the US postal service database (Section A.1). This 
database contains 9298 examples of dimensionality 256, of which 2007 make up the 
test set. For computational reasons, we used a subset of 3000 training examples to 
compute the matrix K. We then used polynomial Kernel PCA to extract nonlinear 
principal components from the training and test set. To assess the utility of the 
components (or features), we trained a soft margin hyperplane classifier on the 
classification task. 

Table 14.1 illustrates two advantages of using nonlinear kernels: first, perfor- 
mance of a linear classifier trained on nonlinear principal components is better 
than for the same number of linear components; second, the performance for non- 
linear components can be further improved by using more components than pos- 
sible in the linear case. The latter is related to the fact that there are many more 
higher-order features than there are pixels in an image. Regarding the first point, 
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Table 14.1 Test error rates on the USPS handwritten digit database, for linear Support 
Vector Machines trained on nonlinear principal components extracted by PCA with kernel 
(14.21), for degrees 1 through 7. The case of degree 1 corresponds to standard PCA, with 
the number of nonzero eigenvalues being at most the dimensionality of the space (256). 
Clearly, nonlinear principal components afford test error rates which are lower than in the 
linear case (degree 1). 
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note that extracting a certain number of features in a 10!°-dimensional space con- 
stitutes a much greater reduction of dimensionality than extracting the same num- 
ber of features in 256-dimensional input space. 

For all numbers of features, the optimal kernel degree to use is around 4, which 
is consistent with Support Vector Machine results on the same data set. Addition- 
ally, with only one exception, the nonlinear features are superior to their linear 
counterparts. The resulting error rate for the best of our classifiers (4.0%) is much 
better than that obtained using linear classifiers operating directly on the image 
data (a linear Support Vector Machine achieves 8.9%; [470]); performance is iden- 
tical to that of nonlinear Support Vector classifiers [470]. This makes sense — recall 
from Section 2.2.6 that using all principal components is equivalent to running a 
nonlinear SVM with the same kernel. After all, if we consider all eigenvectors, Ker- 
nel PCA is just an orthogonal basis transformation, leaving the dot product invari- 
ant (cf. Section B.2). For a comprehensive list of results obtained on the USPS set, 
cf. Chapter 7. Note that the present results were obtained without using any prior 
knowledge about invariances of the problem at hand, which is why the perfor- 
mance is inferior to Virtual Support Vector classifiers (3.2%, Chapter 11). Adding 
local translation invariance, be it by generating “virtual” translated examples or 
by choosing a suitable kernel incorporating locality (such as those in Section 13.3, 
which led to an error rate of 3.0%), could further improve the results. 

Similarly good results have been obtained for other visual processing tasks, 
such as object recognition [467] and texture classification [294]. Kernel PCA has 
also been successfully applied to other problems, such as processing of biological 
event-related potentials [440], nonlinear regression [441], and document retrieval 
[126], as well as face detection and pose estimation [328, 436]. Yet another applica- 
tion, in image denoising, is described later (Chapter 18). One of the more surpris- 
ing applications, with impressive success, is model selection [101]. 
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Whilst it is encouraging that Kernel PCA leads to very good results, there are nev- 
ertheless some issues that need to be addressed. First, the computational complex- 
ity of the standard version of Kernel PCA, as described above, scales as O(m?) in 
the sample size m. Second, the resulting feature extractors are given as dense ex- 
pansions in terms of the training patterns. Thus, for each test point, the feature 
extraction requires m kernel evaluations. Both issues can be dealt with using ap- 
proximations. As mentioned above, Kernel PCA can be approximated in a sparse 
greedy way (Section 10.2); moreover, the feature extractors can be approximated in 
a sparse way using reduced set techniques, such as those described in Chapter 18. 
Alternatively, Tipping recently suggested a way of sparsifying Kernel PCA [540]. 

There is a second way of approaching the problem, however. Rather than stick- 
ing to the original algorithm, and trying to approximate its solution, we instead 
modify the algorithm to make it more efficient, and so that it automatically pro- 
duces sparse feature extractors. In order to design the modified algorithm, called 
sparse kernel feature analysis (KFA), it is useful to first describe a general feature 
extraction framework, which will contain Kernel PCA and sparse KFA as special 
cases. 

To this end, denote by X := {x1,...Xm} C X our set of patterns drawn inde- 
pendently and identically distributed from an underlying probability distribution 
P(x). Our goal is to compute feature extractors that satisfy certain criteria of sim- 
plicity (such as small RKHS norm [578, 512] or ¢; norm [343, 104, 37, 502, 72, 347]) 
and optimality (maximum variance [248, 280], for instance). 


14.4.1 Principal Component Analysis 
Let us start with PCA, assuming that X C RV. The first principal component of 


a sample is given by the direction of projection with maximum variance. For 
centered data, 


X := E 


the first eigenvector v' can be obtained as 


= 1 m , 
Ban PS miata, (14.22) 


1 m 
v! =argmax — Y |(z, x;) |’. (14.23) 
loe<ı "i= 


The successive eigenvectors v?, ...v™ are chosen to be orthogonal to those preced- 


ing, where each eigenvector v’ satisfies a property similar to (14.23) with respect 
to the remaining (N — i+ 1)-dimensional subspace. 

The solution of this optimization problem is normally obtained by computing 
the largest principal component of the covariance matrix of X (14.1). We shall 
show below that there exist situations where finding the solution of problems like 
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(14.23) can be much easier than that. This simplification is achieved by replacing 
the constraint on v by one that lends itself to faster evaluation and optimization. 
We shall return to this point below, in a more general setting. 


14.4.2 Kernel PCA 


We denote by ®(x;) = ®(x;) — +, ®(x;) the centered version of the data in 
feature space, and define 


Foc = {w|w EH, |w]? <1} (14.24) 


to be the set of candidate vectors to project on. The problem of finding the first 
Kernel PCA eigenvector can then be stated as 
m 
v! = argmax = > (v,@(x)))” = argmax Var { (v, (x;))}. (14.25) 
vefs "iZ VEF 3¢ 

This modification, building on Proposition 14.2, may seem innocuous. Neverthe- 
less, it allows us to modify the feature extraction problem in interesting ways, 
by replacing Fy with other sets that are more suitable for optimization. Before 
moving on, let us recall the function space interpretation of Kernel PCA, already 
mentioned in Section 14.2.2. 

As explained in Section 2.2, we may think of the feature space in different 
representations. One such representation, discussed in Chapter 4, uses functions 
f expanded in terms of kernels. Therefore, rather than using the abstract feature 
space element w, the regularizer can be thought of in terms of functions f, and 
Q[f] = || f||§-- In this case, the constraint (14.24) becomes a constraint in this func- 
tion space. This means that we are looking for the function f with the largest em- 
pirical variance under the constraint Q[ f] < 1; in other words, we would like f not 
to be overly complex. Depending on the specific RKHS (and thus, depending 
on the kernel), this can mean a function with small first derivative, or a small sum 
of derivatives, or a particular frequency spectrum (cf. (4.28)). The criterion of large 
variance under constraints can thus be interpreted as the requirement to seek a 
simple yet “interesting” function of the current observations. In the next section, 
we replace Q[ f] by another regularization functional, which turns out to be better 
suited to optimization. 


14.4.3 Sparse Kernel Feature Analysis 


Sparse solutions are often achieved in supervised learning settings by using an ¢; 
penalty on the expansion coefficients (see Section 4.9.2 and [343, 104, 184, 459, 37, 
502, 71, 347]). We now use the same approach in feature extraction, deriving an 
algorithm which requires only n basis functions to compute the first n features. 
This algorithm is computationally simple, and scales approximately one order of 
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Figure 14.8 Left: the circle denotes the set Fy; of possible v to be used to find the projection 
of maximum variance. Right: the absolute convex hull Fp determines the set of possible 
directions of projection under the constraint ¥;|a;| < 1. In both cases, the small circles 
represent observations. 


magnitude better on large datasets than standard Kernel PCA. First, we choose 


m 


Of] = 2 Jail, (14.26) 


as suggested in Section 4.9.2. This yields 


Fip = fv 


where the ‘LP’ in the name derives from the use of similar constructions in linear 
programming (see for instance Section 7.7). Figure 14.8 gives a depiction of Frp. 
This setting leads to the first “principal vector” in the ¢; context, 


m m 
w= > aj; P(x;) with 5 lai| < 1 5 (14.27) 
i=1 i=l 


2 
vi= argmax 2 >» (v @(x;) — = 5 op) : (14.28) 
véF.p i=1 m =i 
Again, subsequent “principal vectors” can be defined by enforcing optimality 
with respect to the remaining orthogonal subspaces. Due to the 4 constraint, 
the solution of (14.28) has the favorable property of being sparse in terms of the 
coefficients a; (the coefficients are chosen from the “hyper-diamond-shaped” 4 
ball).> In fact, as we shall show in Section 14.5, the optimal solution is found by 
picking the direction ®(x;) corresponding to a single pattern, meaning that the 
solution lies on one of the vertices of the 4 ball. We shall return to this point below. 


5. Note that the requirement of ||v||? = 1, or the corresponding £; constraint, are necessary 
— the value of the target function could otherwise increase without bound, simply by 
rescaling v. 
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For the sake of completeness, note that rather than using 4 constraints, we could 
instead seek optimal solutions with respect to £, balls, 


Fy = fv 


In a nutshell, we have obtained sparse KFA by sticking with the variance crite- 
rion, and modifying the constraint. By contrast, in the next section, we revisit the 
standard £% ball constraint that we know from (kernel) PCA, and instead change 
the objective function to be maximized subject to the constraint. 


w = J a@(x) with ¥ Jail? <1 . (14.29) 


i=] i 


14.4.4 Projection Pursuit 


Projection Pursuit [182, 252, 272, 176, 181, 139] differs from Principal Component 
Analysis (Section 14.4.1) in that it replaces the criterion of maximum variance by 
different criteria, such as non-Gaussianity. The first principal direction in Projec- 
tion Pursuit is given by 
m 

v! =argmax — ¥ q ((v, ži)), (14.30) 

Ijol2<a i= 
where q : R + R is a function such that Yj, q ((v,%;)) is large whenever the 
distribution of (v, ¥;) is non-Gaussian. More generally, if some coupling between 
the different projections occurs, we can write Projection Pursuit as 


v! = argmax Q ((v,%1),...(0,%m)) - (14.31) 
Wlol?<1 

A possible function q is for instance q(é) = ¿*. Apart from non-Gaussianity, 

contrast functions are sometimes designed to capture other properties, such as 

whether the distribution of features has multiple modes, the Fisher Information 

(so as to maximize it), the negative Shannon entropy, or other quantities of interest. 

For a detailed account of these issues see [182, 181, 252, 272, 176, 227]. 

To evaluate these contrast functions, it is often necessary to first compute a den- 
sity estimate for the distribution of (v, x;), for instance using the Parzen windows 
method. A final issue to note is that the determination of interesting directions is 
often quite computationally expensive [109, 301], since (14.31) may exhibit many 
local minima unless Q is convex. Practical projection pursuit tools (such as XGobi 
[533]) use gradient descent for optimization purposes, sometimes with additional 
(interactive) user input. 


14.4.5 Kernel Projection Pursuit 


With slight abuse of notation, F is used below to denote both the set of possible 
weight vectors w, and the set of functions f that satisfy a corresponding constraint 
on f (e.g., f(x) = (w, ®(x)) where ||w|| < 1). 

We are now ready to state a general feature extraction framework that contains 
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PCA, Projection Pursuit, and Kernel PCA as special cases, by combining the mod- 
ifications of Sections 14.4.3 and 14.4.4. We obtain 


m 


fl =argmax + Y q), (14.32) 
fer ™ 


i=1 


or more generally, 


f! Saremax Q (f1), -< -, fm) 5 (14.33) 
fEF 


where q(-) and Q(-) are functions which are maximized for a given property of the 
resulting function f(x), and 


F:= {f|f:%— Rand Q[f] < 1}. (14.34) 


Note that if F is the class of linear functions in input space (that is, if {f |f (x) = 
(w, x) and ||w||* < 1}, where the projections are restricted to the unit ball), we 
recover Projection Pursuit. 


14.4.6 Connections to Supervised Learning 


The setting described above bears some resemblance to the problem of minimizing 
a regularized risk functional in supervised learning (Chapter 4). In the latter case, 
we try to minimize a function (the empirical risk Remp[f]) that depends on the 
observed data, with the constraint Q[f] < c. In the feature extraction setting, we 


try to maximize a function Q[f] = Q(f (x1), - - -, f(%m)) under the same constraint. 
Risk Minimization Feature Extraction 
minimize RempIf] maximize OQ[f] (14.35) 


subject to Q[f]< c subject to = Q[f]<c 


This means that many of the theoretical guarantees from supervised learning, such 
as bounds on the difference between Q[f] and the expectation E[Q[f]], can be 
obtained directly from their analogues in classification and regression. 

A cautionary remark is necessary, however: since the class of possible feature 
extractors is now significantly larger than in projection pursuit, we have to be 
very careful not to pick a feature extractor f‘ that renders any dataset “interesting” 
if viewed as {f'(x1),..., f'(Xm)}. This means that not all Q should be used, and 
in particular not the scale invariant versions of Q, since the latter render the 
constraint Q[f] < c ineffective. 

Finally, if Q is not convex, the maximum search over the extreme points does 
not provide us with the best solution. Although we may still apply the algorithm, 
we lose some of its good theoretical properties. 
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We now return to sparse KFA; that is, the feature extraction algorithm that maxi- 
mizes variance subject to an ¢; constraint (Section 14.4.3). We focus on how to actu- 
ally solve the optimization problem. Despite the superficial similarity between the 
two settings in (14.35), the resulting algorithms for supervised and unsupervised 
learning are quite different. This is due to the fact that one problem (the supervised 
one) is usually a convex minimization problem, whereas the unsupervised problem 
requires convex maximization. 


14.5.1 Solution by Maximum Search 


Recalling Theorem 6.10, the feasibility of the convex maximization problem de- 
pends largely on the extreme points of the set F, as defined by (14.34). In other 
words, the optimization problem can be solved efficiently only if the set of ex- 
treme points of F is small and can be computed easily. Otherwise, only a brute 
force search over the extreme points of F (which can be NP hard) yields the maxi- 
mum. This effectively limits the choice of practically useful constraint sets F to Fp 
and F, (where p < 1). The extreme points in both sets coincide, and equal 


Extreme Points (Fip) = {+k(x;, x)|i € [m]}. (14.36) 
Thus, we obtain the following corollary of Theorem 6.10. 


Corollary 14.3 (Vertex Solutions for Kernel Feature Analysis) If the functions f 
and —f generally yield the same Q value, and p < 1, we have 


f= arg (F(x1), «++, f%m)) (14.37) 
ee, 
= ose Q (F1), ---, fm) (14.38) 
EFp 
= argmax QFE,- --, fm). (14.39) 
felkar), sklm,)} 


Under the above symmetry assumption, we can limit ourselves to analyzing the 
positive orthant only. See Figure 14.9 for a pictorial representation of the shapes of 
unit balls corresponding to different norms. 

Eq. (14.37) provides us with a simple algorithm to solve the feature extraction 
problems introduced in Sections 14.4.3 and 14.4.5: simply seek the kernel function 
k(x;, +) with the largest value of Q(k(x;, x1), . . ., k(Xi, Xm)). 


14.5.2 Sequential Decompositions 
We now address how to proceed once the first direction of interest (or function) has 


been found. In the following, we denote by F' the space of directions to select from 
in the ith round; in particular, F! := F. To keep matters simple, we limit ourselves 
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Figure 14.9 Several unit balls in R’. From left to right: €95, 41, 6, and 4%. Note that the 4.5 
and £; balls have identical extreme points. 


to the dot product representation of the basis functions, f(x) = (v,®(x)). The 
vectors spanning this space are denoted by Di. We start with } = (xj). Unless 
stated otherwise, we focus on Fip, for which a 1-norm is taken on the coefficients.© 
Hence we have 


Fi:= fv 


We discuss the following three possible choices: 


m . m . 
w= > aj; with È Jai] <1 | (14.40) 
j=l j=l 


Removal: We might, for instance, simply remove the corresponding vector oF = 
(x ;) from the set of possible directions (and keep all other directions unchanged), 
so that pit! = vi, and repeat the proposed procedure to find the next vector v;41. 
This may lead to many very similar ‘principal’ directions (each of which might 
be interesting on its own), however, which implies that many such directions add 
little additional information (cf. Figure 14.10). This is definitely not desirable, even 
though the computational cost for subsequent calculations is very low (a simple 
sorting operation), once all Q values are computed. 


Unnormalized Projection: A second alternative is to require that each direction v‘ 
be orthogonal to all previous directions, (v', vÍ} = 0 for alli > j. The easiest way to 
achieve this is by orthogonalizing all the vectors spanning F'+! with respect to vi, 
i+1.— i Vi i yi 
As in the previous case, this approach also reduces the set of vectors spanning F+! 
by one, since v’ is chosen among i. Computation of the first p principal features 
involves only p kernel functions. This is because D is a linear combination of 
i images of patterns, of which i — 1 are already used in the computation of the 


6. Recall that p-norms with 0 < p < 1 lead to identical solutions on the corresponding set 
P, 
p 
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Q-Value =0.408 Q-Value =0.408 Q-Value =0.407 Q-Value =0.403 


Figure 14.10 Contour plots of the first 4 feature extractors of sparse Kernel Feature Anal- 
ysis, given that patterns are removed after being selected as interesting directions of projec- 
tion. We used Gaussian RBF Kernels (g? = 0.05) for a dataset of 120 samples. The small dots 
represent the data points. Note that the first three feature extractors are almost identical. 


previous i — 1 features. 

Since ©! is not necessarily in F! (the sum of the expansion coefficients œ is no 
longer contained inside the 4, unit ball), we call this approach Unnormalized Kernel 
Feature Analysis. 


Normalized Projection: We can obtain another version of the algorithm by nor- 
malizing the expansion coefficients of oi to have unit p-norm. Since we do not use 
this version presently, we refer to [508] for details. 


Henceforth, we consider only unnormalized projection. We thus obtain an algo- 
rithm for sparse KFA which requires just O(p - m) operations for the computation 
of p features for new data. For further detail, see [508]. 

Extraction of the principal directions themselves, however, is an O(m?) oper- 
ation per feature extractor, as in the case of kernel PCA.” This cost arises since 
finding the direction with maximum Q value still requires computation of all dot 
products between all possible directions of projection and the actual patterns to be 
analyzed. 


14.5.3 A Probabilistic Speedup 


It is wasteful to compute all possible Q values for all directions, given that we 
only choose one of these directions. This suggests that we should terminate before 
completion the calculation of those directions that do not seem to be promising in 
the first place. When doing this, however, we must ensure a low likelihood that 
important directions are lost. In [508], Corollary 6.34 is used to derive probabilistic 
bounds on the error incurred. This leads to a method for approximating calcula- 
tions by only summing over half of the terms. Applying the method in a divide 
and conquer fashion, similar to the Fast Fourier Transform [110]), we end up with 


7. The constants, however, are significantly smaller than when computing eigenvectors of 
a matrix (the latter requires several passes over the matrix K;;). 
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a computational cost of O(m log m) for computing a single function; this represents 
a significant improvement over the O(m?) cost of Kernel PCA. 

In the next section, we describe a different way of speeding the algorithm. 
Rather than approximating the sums by computing only a subset of the terms, 
we instead compute only some of the sums. 


14.5.4 A Quantile Trick 


Rather than attempting to find the best n feature extractors, we may be content 
with feature extractors that are among the best obtainable. This is a reasonable 
simplification, given that v? and related quantities are themselves obtained by 
approximation. 

For instance, it might be sufficient for preprocessing purposes if each feature 
were among the best 5% obtainable. This leads to another approach for avoiding 
a search over all m possible directions: compute a subsample of ñ directions, and 
choose the largest Q-value among them. We can show (see Corollary 6.32) that 
such a sub-sampling approach leads on average to values in the „Ž quantile 
range. Moreover, Theorem 6.33 shows that a subset of size 59 is already sufficient 
to obtain results in the 95% quantile range with 95% probability. 

Overall, computational complexity for the extraction of a single feature is re- 
duced to O(cm) rather than O(m7). The same applies to memory requirements, 
since we no longer have to compute the whole matrix K beforehand. Thus unless 
the best feature extractors are needed, this should be the method of choice. 


14.5.5 Theoretical Analysis 


Due to lack of space, we do not give a statistical analysis of the algorithm. Suffice 
to say that [508] contains a brief analysis in terms of capacity concepts, such as 
covering numbers (cf. Chapters 5 and 12). The basic idea is that due to the use of 
regularizers such as the ||w||? term, uniform convergence bounds can be given on 
the reliability of the feature extractors; in other words, bounds can be derived on 
how much the variance of (say) a feature extractor differs between training and 
test sets. 


14.6 KFA Experiments 


Let us again consider the toy example of three artificial Gaussian data clusters, 
which we used in Figure 14.4 in the case of Kernel PCA. The randomized quantile 
version of KFA, shown in Figure 14.11, leads to rather similar feature extractors, 
although it is significantly faster to compute. 

The main difference lies in the first few features. For instance, KFA uses only one 
basis function for the first feature (due to the built-in sparsity), which enforces a 
feature extractor that resides on one of the three clusters. Kernel PCA, on the other 
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Q-Value=0.373 Q-Value=0.394 


Figure 14.11 The first p = 16 features of sparse Kernel Feature Analysis, using the same 
kernel as in Figure 14.4. Note that every additional feature only needs one more kernel 
function to be computed. We used the randomized version of the algorithm, for which only 
a subset of 10 features per iteration is used (leading to an average quantile of over 95%). 
Note the similarity to Figure 14.4, which is an O(m’) rather than O(pm) algorithm (per 
feature extractor). 


hand, already has contributions from all basis functions for the first feature. 

In all cases, it can be seen that the features are meaningful, in that they reveal 
nontrivial structure. The first features identify the cluster structure in the data set, 
while the higher order features analyze the individual clusters in more detail. 

To see the effect of Sparse KFA on real data, we carried out a small experiment 
on the MNIST dataset of handwritten digits (Figure 14.12). We observe that almost 
all digits appear among the first 10 basis kernels, and that the various copies of 
digit ‘1’ do not overlap much and are therefore approximately orthogonal when 
compared with the Gaussian RBF kernel. 


This chapter introduced the kernel generalization of the classical PCA algorithm. 
Known as Kernel PCA, it represents an elegant way of performing PCA in high 
dimensional feature spaces and getting rather good results in finite time (via a 
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| 7 6 Ô 3 Figure 14.12 The 15 samples corresponding to the first fea- 
tures extracted by Sparse Kernel Feature Analysis from the 
f q | af / NIST database of handwritten digits. Note how the algo- 
rithm picks almost all digits at least once among the first 

7 5 6 2 9 10. 
simple matrix diagonalization). 


We pointed out some parallels to SVMs, in terms of the regularizer that is 
effectively being used, and we reported experimental results in nonlinear feature 
extraction applications. 

Linear PCA is used in numerous technical and scientific applications, including 
noise reduction, density estimation, and image indexing and retrieval systems. 
Kernel PCA can be applied to all domains where traditional PCA has so far been 
used for feature extraction, and where a nonlinear extension would make sense. 

There are, however, some computational issues, which make it desirable to think 
of alternatives to Kernel PCA that can be applied in situations where, for instance, 
the sample size is too large for the kernel matrix diagonalization to be feasible. 
Motivated by this, we described KFA, a modification which utilizes a sparsity 
regularizer. The solution of KFA can be found on the set of extreme points of 
the constraints, provided the contrast function itself is convex. In particular, if the 
constraints form a polyhedron, the extreme points can be found on the vertices. 
This reduces a potentially complex optimization problem to a maximum search 
over a finite set of size m. Randomized subset selection methods help to speed 
up the algorithm to linear cost and constant memory requirement per feature 
extractor. 

We explained how both algorithms, along with classical approaches such as pro- 
jection pursuit, can be understood as special cases of a general feature extraction 
framework, where we maximize a contrast function under a capacity constraint. 
This may be a sparsity constraint, a feature space vector length constraint, or some 
other restriction, such as the size of the derivatives. 


14.8 Problems 


14.1 (Positive Definiteness of the Covariance Matrix e) Prove that the covariance 
matrix (14.1) is positive definite, by verifying the conditions of Definition 2.4. This im- 
plies that all its eigenvalues are nonnegative (Problem 2.4). 


14.2 (Toy Examples e) Download the Kernel PCA Matlab code from http:/www.kernel- 
machines.org. Run it on two toy datasets which are related to each other by a translation 
in input space. Why are the results identical? 


14.8 Problems 
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14.3 (Pre-Image Problem e) Unlike PCA in input space, Kernel PCA only allows the 
computation of the feature values, but not explicitly the eigenvectors themselves. Discuss 
the reason for this difference, and the implied differences in the applicability of the tech- 
niques. 


14.4 (Null Space of Kernel PCA ee) How many eigenvectors with eigenvalue 0 can 
Kernel PCA have in H? Discuss the difference with respect to PCA in input space. 


14.5 (Centering [480] ee) Derive the equations for Kernel PCA with data that does not 
have zero mean in H. 
Hints: Given any ® and any set of observations x1,...,Xm, the points 


@(x;) := O(x;) — L J (x;) (14.42) 
i=1 


1 


are centered. 


1. Expand the eigenvectors in term of the ®(x;), and derive the modified eigenvalue 
problem in the space of the expansion coefficients. 


2. Derive the normalization condition for the coefficients to ensure the eigenvectors have 
unit norm. 


3. For a set of test points ty,...,tn € X, derive a matrix equation to evaluate the n feature 
values corresponding to the kth centered principal component. 


14.6 (Expansion of KPCA Solutions ee) Argue that each solution of the eigenvalue 
problem for centered data could also be expanded in terms of the original mapped patterns. 
Derive the corresponding dual eigenvalue problem. How does it compare to the other one? 


14.7 (Explicit PCA in Feature Space e) Consider an algorithm for nonlinear PCA 
which would explicitly map all data points into a feature space H via a nonlinear map 
®, such as the mapping induced by a kernel. Discuss under which conditions on the fea- 
ture space this would be preferable to the kernel approach. Argue that Kernel PCA always 
effectively works in a finite dimensional subspace of H, even when the dimensionality of 
KH is infinite. 


14.8 (The Kernel PCA Feature Map ee) Suppose that A; > 0 for all i. Prove that the 
feature map (2.59) satisfies 

(Di (x), DR) = (Pxpca(x), Pxrca(x’)) (14.43) 
for all x, x' € X, where (cf. (14.16)) 


Ọkrca : X —> R” 


xe (5 ots) ; (14.44) 
i i=l ...,9 


i=l 


Hint: note that if K = UDU! is K’s diagonalization, with the columns of U being the 
eigenvectors of K, then K7! = UD-'/2U". Use this to rewrite (2.59). Argue that as 
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in (11.12), the leading U can be dropped, since U is unitary. The entries of the diagonal 
matrix D-*/? equal a” ? thus performing the kernel PCA normalization (14.14). 

Next, argue that more generally, if 4; > 0 is not the case for all i, the above construction 
leads to a p-dimensional feature map (as in Section 14.2.1, we assume that the first p 
eigenvalues are the nonzero ones) 


kreca (x) = (5 otis) (14.45) 
i=l i=1,...p 

satisfying 

(®xpca (xi), Pxpca(x))) = klx; xj) (14.46) 


for alli, j € [m]. 

Finally, argue that the last equation may be approximately satisfied by a feature map 
which is even lower dimensional, by discarding all eigenvalues which are smaller than 
some e > 0. 


14.9 (VC Bounds for Kernel PCA 0o00) Construct a VC theory of Kernel PCA; in other 
words, give bounds on the variance of a Kernel PCA feature extractor on the test set in 
terms of the variance on the training set, the size of the corresponding eigenvalue, and the 
covering numbers of the kernel-induced function class (see Section 14.4.6). 


14.10 (Connection KPCA — SVM ee) From the known properties of PCA (cf. Propo- 
sition 14.1), prove Proposition 14.2. 


14.11 (Transformation Invariances ee) Consider a transformation £+, parametrized by 
t, such as translation along the x-axis. To first order, the effect of a small transformation 
(small t) can be studied by considering the tangent vectors P(L;x;) — P(x;). Mathemati- 
cally derive invariant feature extractors by performing PCA on the covariance matrix of the 
tangent vectors (the tangent covariance matrix). Note the following problem: invariant 
feature extractors should have small eigenvalues, but eigenvectors with eigenvalue 0 do 
not necessarily lie in the span of the mapped examples (cf. (14.3)). 


14.12 (Transformation Invariances, Part II eee) Extend the previous approach by si- 
multaneously aiming for invariance under L; and for variance in the original Kernel PCA 
directions (cf. [364]). 

Hint: formulate a problem of simultaneous diagonalization. 


14.13 (Singularity of the Centered Covariance Matrix e) Prove that if a* is an 
eigenvector, with nonzero eigenvalue, of the centered covariance matrix, then X; a¥ = 0. 
Why does this imply that the centered covariance matrix is singular? 


14.14 (Primal and Dual Eigenvalue Problems [480] ee) Prove that (14.12) yields all 
solutions of (14.7). 

Hint: show that any solution of (14.11) which does not solve (14.12) differs from a 
solution of (14.12) only by a vector a with the property X; a;P(x;) = 0. 


14.8 Problems 
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14.15 (Multi-Layer Support Vector Machines e) By first extracting nonlinear princi- 
pal components according to (14.16), and then training a Support Vector Machine, we can 
construct Support Vector type machines with additional layers. Discuss the architecture, 
and the different ways of training the different layers. 


14.16 (Mechanical Analogy 0co) Try to generalize the mechanical PCA algorithm de- 
scribed in [443], which interprets PCA as an iterative spring energy minimization proce- 
dure, to a feature space setting. Try to come up with mechanically inspired ways of taking 
into account negative data in PCA (cf. oriented PCA, [140]). 


14.17 (Kernel PCA and Locally Linear Embedding ee) Suppose we approximately 
represent each point of the dataset as a linear combination of its n nearest neighbors. Let 
(W,,)ij, where i, j € [m], be the weight of point x; in the expansion of x; minimizing the 
squared representation error. 


1. Prove that k,(x;,x;) := (€ — W,)"(- Wn) is a positive definite kernel on the 
domain X = {x1,..., Xm}. 

2. Let A be the largest eigenvalue of (1 — Wn)'(1 — W,„). Prove that the LLE kernel 
ki E(x; xj) := (A —1)1+ W} +W, — W; Wn) i; is positive definite on {x1,...,Xm}- 
3. Prove that kernel PCA using the LLE kernel provides the LLE embedding coefficients 
[445] for a d-dimensional embedding as the first d coefficient eigenvectors a',..., Qu. 
Note that if the eigenvectors are normalized in H, then dimension i will be scaled by A, p 
t= lysed: 


4. Discuss the variant of LLE obtained using the centered Gram matrix 
(1=4n) (A= D1+WI +W,- WIW, ) In) (14.47) 
(cf. (14.17)). Which space does the centering apply to? 


5. Interpret the LLE kernel as a similarity measure based on the similarity of the coeffi- 
cients required to represent two patterns in terms of n neighboring patterns. 


14.18 (Optimal Approximation Property of PCA e) Discuss whether the solutions of 
KFA satisfy the optimal approximation property of Proposition 14.1. 


14.19 (Scale Invariance ee) Show that the problems of Kernel PCA and Sparse Kernel 
Feature Analysis are scale invariant; meaning that the solutions for Q[f] < c and QJ f] < 
c for c,c’ > Oare identical up to a scaling factor. 

Show that this also applies for a rescaling of the data in Feature Space. What happens 
if we rescale in input space? Analyze specific kernels such as k(x,x') = (x,x')4 and 
k(x, x!) = exp(— SIP), 

14.20 (Contrast Functions for Projection Pursuit eee) Compute for q(£) = &* the ex- 
pectations under a normal distribution of unit variance. What happens if you use a differ- 
ent distribution with the same variance? 
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Can you find an optimal function q(€) provided that we are only dealing with densities of 
zero mean and unit variance? Hint: use a Lagrangian and variational derivatives; in other 
words, set up a constrained optimization problem as in (6.39), but with integrals rather 
than sums in the individual constraints (see [206] for details on variational derivatives). 
You need three constraints: f p(€) = 1, f €p(€) =0, and f €p(é) =1. 


14.21 (Cutting Planes and F, eee) Compute the vertices of the polyhedral set obtained 
by orthogonally cutting F, with one of the vectors ®(x;). Can you still compute them if we 
replace F, by Fy? 

Show that the number of P(x ;) required per vertex may double per cut (until it involves 
all m of the ®(x;)). 


14.22 (Pre-Image Problem ee) Devise a denoising algorithm for Sparse Kernel Feature 
Analysis, using the methods of Chapter 18. 


14.23 (Comparison Between Kernel PCA and Sparse KFA e) Plot the variances ob- 
tained from the sets of the Kernel PCA and Sparse KEA projections. Discuss similarities 
and differences. Why do the variances decay more slowly (with the index of the projection) 
for Sparse KEA? 


14.24 (Extension to General Kernels 000) Can you extend the sparse feature extrac- 
tion algorithm to kernels which are not positive definite? Hint: begin with a modification 
of V1. Which criterion replaces orthogonality in feature space (e.g., 4™ on X)? Does the 
algorithm retain its favorable numerical properties (such as cheap diagonalization)? What 
happens if you use arbitrary functions fj and (Y fi, Y f;) as the corresponding dot product? 


14.25 (Uniform Convergence Bounds eee) Prove a bound for the deviation between 
the expected value of Q[f] and its empirical estimate, P{|E[QLf]] — QLf]|}. Hint: use 
uniform convergence bounds for regression. 
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In the previous chapter, we reported experiments in which kernel PCA feature ex- 
traction was applied to solve classification problems. This was done by following a 
two-step approach: first, extract the features, irrespective of the classification task; 
second, train a simple linear discriminative classifier on the features. 

It is possible to combine both steps by constructing a so-called Kernel Fisher 
Discriminant (KFD) (e.g. [363, 364, 442, 27], cf. also [490]). The idea is to solve the 
problem of Fisher’s linear discriminant [171, 186] in a feature space H, thereby 
yielding a nonlinear discriminant in the input space. 

The chapter is organized as follows. After a short introduction of the standard 
Fisher discriminant, we review its kernelized version, the KFD algorithm (Sec- 
tion 15.2). In Section 15.3, we describe an efficient implementation using sparse 
approximation techniques. Following this, we give details on how the outputs of 
the KFD algorithm can be converted into conditional probabilities of class mem- 
bership (Section 15.4). We conclude with some experiments. 

Most of the chapter only requires knowledge of the kernel trick, as described 
in Chapter 1, and, in more detail, Chapter 2). To understand the connection to 
SVMs, it would be helpful to have read Chapter 7. The details of the training 
procedure described in Section 15.3 are relatively self-contained, but are easier to 
understand after reading the background material in Section 6.5, and (optionally) 
Section 16.4. Finally, Section 15.4 requires some basic knowledge of Bayesian 
methods, as provided for instance in Section 16.1. 


15.1 Introduction 


Rayleigh 
Coefficient 


Let us start by giving a concise summary of the Fisher discriminant algorithm, 
following the treatment of [375]. For further detail, see [363]. 

In the linear case, Fisher’s discriminant is computed by maximizing the so- 
called Rayleigh coefficient with respect to w, 


(15.1) 


depending on the between- and within-class variances, 


Sg =(m_—m,)(m_—m,)! (15.2) 
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WPCA Figure 15.1 Illustration 
of the main projections 
of PCA and Fisher’s 
Discriminant for a toy 
data set. PCA does not 
consider the labels (indi- 
cated by solid and open 
symbols, respectively) 
and simply returns the 
direction of overall max- 
imum variance as the 
first eigenvector. Fisher’s 
discriminant, on the 
other hand, returns a 
projection that yields a 
much better separation 
between the two classes 
(from [375]). 


W Fisher 


and 


Sw= ¥ Yj —m,)(x; — m)”. (15.3) 
= 


+icJ 


Here, m; and J} denote the sample mean and the index set for class q, respectively. 
The idea is to look for a direction such that when the patterns are projected onto 
it, the class centers are far apart while the spread within each class is small — this 
should cause the overlap of the two classes to be small. 

Figure 15.1 gives a sketch of the projection w found by Fisher’s Discriminant. 
Unlike PCA, this projection takes the class labels into account. We can show that 
the Fisher discriminant finds the optimal discriminating direction between the 
classes (in the sense of having minimal expected misclassification error), subject 
to the assumption that the class distributions are (identical) Gaussians. 


15.2 Fisher’s Discriminant in Feature Space 


To formulate the problem in a feature space H, we can expand w € H as 


w= > aj;®(x;), (15.4) 


as in the case of Kernel PCA (14.9). Below, we use the notation 1, to denote the 

m-dimensional vector with components [1,]; equal to 1 if the pattern x; belongs 

to class q, and 0 otherwise. Additionally, let m4 := |J,| be the class sizes, and 
1 

[ig = z Kly 


Mg 


N=KK'— Ý mq; (15.5) 


15.2 Fisher's Discriminant in Feature Space 459 


Rayleigh 
Coefficient in H 


Eigenvalue 
Formulation 


QP Formulation 


u= H- — H+, M = up’, and Kij = (®(x;), ®(x))) = K(x;,x;). A short calculation 
shows that the optimization problem consists of maximizing [364] 


a'Ma _ (aly) 


— f 15. 

Ka a'Na a'Na me) 

The projection of a test point onto the discriminant is computed by 

(w, ®(x)) = X ajk(x;, x). (15.7) 
i=1 


As the dimensionality of the feature space is usually much higher than the number 
of training samples m, it is advisable to use regularization. In [363], the addition of 
a multiple of the identity or of the kernel matrix K to N was proposed, penalizing 
||cx||* or ||w||? respectively (see also [177, 230]). 

There are several equivalent ways of maximizing (15.6). We could solve the 
generalized eigenvalue problem, 


Ma = ANa, (15.8) 
selecting the eigenvector œ with maximal eigenvalue A, or compute 
a x N7(u_ — u+) (Problem 15.2). (15.9) 


Although there exist many efficient off-the-shelf eigenvalue problem solvers or 
Cholesky packages which could be used to optimize (15.6), two problems remain: 
for a large sample size m, the matrices N and M become large, and the solutions 
q@ are non-sparse. One way of dealing with this issue is to transform KFD into 
a convex quadratic programming problem [362]. Apart from algorithmic advan- 
tages, this formulation also allows for a more transparent view of the mathematical 
properties of KFD, and in particular its connection to SV classifiers (Chapter 7) and 
the Relevance Vector Machine ([539, 362], see Chapter 16). 

Recalling that Fisher’s Discriminant tries to minimize the variance of the data 
along the projection whilst maximizing the distance between the average outputs 
for each class, we can state the following quadratic program: 


minimize IEI? + CQ(a), (15.10) 
a, ? 
subject to Ka+1b=y+6, 

1 e=0for q= +. 


Here, a, € € R”, b,C € R y is the vector of class labels +1, and Q(q) is one of the 
regularizers mentioned above; that is, Q(a) = ||a||? or Q(a) = a! Ka. It can be 
shown that this program is equivalent to (15.6). The first constraint, which can be 
read as 


(w,x;) +b = y; + & for alli=1,...,m, (15.11) 


pulls the output for each sample towards its class label. The term ||€||? minimizes 
the variance of the error committed, while the constraints ae = 0 ensure that 
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the average output for each class equals the label; in other words, for +1 labels, 
the average distance of the projections is 2. The formulation in (15.10) has the 
additional benefit that it lends itself to the incorporation of more general noise 
models, which may be more robust than the Gaussian model [362]. 

Besides providing additional understanding, (15.10) allows the derivation of 
more efficient algorithms. Choosing an ¢;-norm regularizer, Q(a) = |/a||1, we 
obtain sparse solutions: as discussed in Chapter 4, the ¢;-norm regularizer is a 
reasonable approximation to an fọ regularizer, which simply counts the number 
of nonzero elements in a. 

For large datasets, solving (15.10) is not practical. It is possible, however, to ap- 
proximate the solution in a greedy way. In the next section, we iteratively approxi- 
mate the solution to (15.10) with as few non-zero a; as possible, following [366]. 


15.3 Efficient Training of Kernel Fisher Discriminants 


To proceed, let us rewrite (15.10), using Q(a) = ||a||?. We define 


[jera] 


Here, m4 denotes the number of samples in class +1. Then (15.10) can be rewritten 
using the equivalent 


m 1'K 
K'1 K'K+C1 


(15.12) 


minimize 5a" Ha —clat+ > (15.13) 
subject to Ala — m4 =0, (15.14) 
Ala+m_=0. (15.15) 


Forming the Lagrangian of (15.13) with multipliers A+, 


L(a; Ap A) 5a"Ha —clat A ,(Ala—m4)+A_(A_Ta+m_)+ > (15.16) 


and taking derivatives with respect to the primal variables a, we obtain the dual 


1 
maximize -5a Ha — A4m44+A-m_ + 2 (15.17) 
a,^4,^- 
subject to Ha — c + (A444 + à-A-) =0. (15.18) 


We now use the dual constraint (15.18) to solve for a, 
a = H7! (c — (A444 + à-A-)). (15.19) 


This equation is well defined if K has full rank (see (15.12)). If not, we can still 
perform this step, as we approximate H~! instead of computing it directly. Resub- 
stituting (15.19) into the dual problem (which has no constraints left) yields the 
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following problem in two variables A+ and A_: 


Woe see 7 
ow, A) As ATHA, AIHA |à 
maximize —= y B 
A+A- 2) r_ AIHA, A HTA À- 
=m, +c HTA, À+ Torcia Mi 
+—-c H` c+. 15.20 
| m_+c'H-1A_ Xa 2 2 ( ) 


This problem can be solved analytically, yielding values for A+ and A_. Substitut- 
ing these into (15.19) give values for a or a and b, respectively. 

Of course, this problem is no easier to solve, nor does it yield a sparse solution: 
H™! is an (m +1) x (m+ 1) matrix, and for large datasets its inversion is not 
feasible, due to the requisite time and memory costs. 

The following greedy approximation scheme can be used (cf. Chapter 6), how- 
ever. Instead of trying to find a full set of m coefficients a; for the solution (15.4), 
we approximate the solution by a shorter expansion, containing only n < m terms. 
Starting with an empty solution n = 0, we select at each iteration a new sample x; 
(or an index i), and resolve the problem for the expansion (15.4) containing this 
new index and all previously picked indices; we stop as soon as a suitable crite- 
rion is satisfied. This approach would still be infeasible in terms of computational 
cost if we had to solve the full quadratic program (15.13) anew at each iteration, or 
invert H in (15.19) and (15.20). But with the derivation given above, it is possible to 
find a close approximation to the solution at each iteration with a cost of O(«mn’), 
where «x is a user defined value (see below). 

Writing down the quadratic program (15.10) for KFD, where the expansion for 
the solution is restricted to an n element subset J C [m] of the training patterns, 
and thus 


wy = ¥ a ®(xi), (15.21) 
i€J 

amounts to replacing the m x m matrix K by an m x n matrix K", where KY, = 
k(x;,xj),i=1,...,m and j € J. We can derive the formulation (15.13) in an anal- 
ogous manner using the matrix K” in (15.12). The problem is then of size n x n. 
Assume we already know the solution (and inverse of H) using n kernel functions. 
Then H™! for n + 1 samples can be obtained by a rank one update of the previous 
H! using only n basis functions: Eq. (10.38) tells us how. For convenience, we 
repeat the statement below. 

To this end, denote by H” the matrix obtained from n basis functions, and by 
H”+! that obtained by adding one basis function to these n functions. Note that 
H” and H"*" differ by only one row/column; we denote this difference by B, for 


the n-vector, and C, for the diagonal entry H ae 41, We may now apply 
-1 
(H+) = H” B 
BUC 
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Algorithm 15.1 The Sparse Greedy Kernel Fisher Algorithm 


arguments: 
Sample X= {x,. s Xm psy = {y1, nna Ym} 
Maximum number of coefficients 
or parameters of other stopping criterion: OPTS 
Regularization constant C 
k and kernel k 
returns: 
Set of indices I and corresponding a-s. 
function SG-KFD(X,y,C,«,k,OPTS) 
Set n=0 
1={} 
while termination criterion not satisfied do 
Choose k elements from [m]\ 1 
for each chosen index do 
Compute column of kernel matrix 
Update inverse, compute optimal a 
Compute new dual objective 
if this objective is smaller than the ones before do 
remember this index 
endif 
end 
Update inverse H and solution a with the best index chosen 
Add this index to I 
Check termination criterion 
endwhile 


[EO ayy (HB) =9((H"Y""B) T 
—(((H")*B))T 7 
where y = (C — B'H~'B)-!. This means that we may compute the inverse of 
H"*' by multiplying a vector with the inverse of H”, and inverting a scalar. 
This is an operation of cost O(n’). The last major problem is to pick an index 
i at each iteration. Ideally we would choose the i for which we get the biggest 
decrease in the primal-objective (or equivalently the dual-objective, since they 
are identical for the optimal coefficients a). We would then need to compute the 
update H -1 for all m — n indices which are unused so far, however — again, this 
is too expensive. One possible solution lies in a second approximation. Instead of 
choosing the best possible index, it is usually sufficient to find an index for which, 
with high probability, we achieve something close to the optimal choice. It turns 
out (Chapter 6) that it can be enough to consider a small subset of indices, chosen 
randomly from those remaining. According to Corollary 6.32 and the discussion 
following it, a random sample of size 59 is enough to obtain an estimate that is 
with probability 0.95 among the best 0.05 of all estimates. 
The complete algorithm for a sparse greedy solution to the KFD problem is 
schematized in Figure 15.1. It is easy to implement using a linear algebra package 
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Figure 15.2 Runtime of sparse greedy KFD training. The number of samples in the train- 
ing set versus the CPU time of the sparse greedy algorithm (dash dotted lines) and the QP 
formulation (15.10) (solid line) are depicted. The estimates are averages over ten trials, one 
for each of the ten one-against-the-rest problems in the USPS database. The three lines for 
the sparse greedy KFD are generated by requiring different accuracies for the dual error 
function, namely 10~,a = 1,...,3, relative to the function value (the curves are plotted 
in this order from bottom to top). There is a speed-accuracy trade-off in that for large a, 
the algorithm converges more slowly. As a sanity check, the a = 3 system was evaluated 
on the USPS test set. Although the parameters used (kernel and regularization constant) 
were those found to be optimal for the QP algorithm, the performance of the sparse greedy 
algorithm is only slightly worse. In the log-log plot it can be seen that the QP algorithm 
roughly scales in a cubic manner with the number of samples, while for large sample sizes, 
the approximate algorithm scales with an exponent of about 3. 


like BLAS [316, 145], and has the potential to be easily parallelized (the matrix 
update) and distributed. 

In a first evaluation, we used a one-against-the-rest task constructed from the 
USPS handwritten digit data set to test the runtime behavior of our new algo- 
rithm. The data are N = 256 dimensional and the set contains 7291 samples. All 
experiments were done with a Gaussian kernel, exp(||x — x’||?/(0.3 N), and using 
a regularization constant C = 1. We compare the performance with the program 
given by (15.10) with the regularizer Q(@) = ||a||?. The results are given in Figure 
15.2. It is important to keep in mind that the sparse greedy approach only needs to 
store at most an n x n matrix, where n is the maximal number of kernel functions 
chosen before termination. In contrast, previous approaches needed to store m x m 
matrices. 


464 


Kernel Fisher Discriminant 


FLD-L1 FLD-L2 FLD-L3 


! Natural i 


Man-made 1! 


Probability 
o 
a 


0 
a 0 1 2 8 
KFD-L3 
0.4 
0.6 
> 
5 4 
8 0.2 \ 
È 
È 0.2 
0 0 
-15 -10 -5 


Figure 15.3 Class histograms of the projections onto the Fisher discriminant direction 
(dashed), and estimated class-conditional densities (solid), for a task consisting of classi- 
fying image patches of natural vs. man-made objects. Top row: three systems using a Fisher 
Linear Discriminant (FLD) in input space. Bottom row: plots for the KFD approach. Note that 
due to the high dimensionality, the histograms are more Gaussian in the KFD case, making 
it easier to estimate class probabilities accurately (from [75]). 


15.4 Probabilistic Outputs 


We conclude this section by noting that while generalization error performance of 
the KFD is comparable to an SVM (cf. Table 15.1), a crucial advantage of the Fisher 
discriminant algorithm over standard SV classification! is that the outputs of the 
former can easily be transformed into conditional probabilities of the classes; in 
other words, numbers that state not only whether a given test pattern belongs 
to a certain class, but also the probability of this event. This is due to the empirical 
observation (Figure 15.3) that in the high-dimensional feature space, the histogram 
of each class of training examples as projected onto the discriminant can be closely 
approximated by a Gaussian. 

To obtain class probabilities, we proceed as follows. We first estimate two one- 
dimensional Gaussian densities for the projections of the training points onto 
the direction of discrimination. We then use Bayes’ rule to derive the conditional 
probability that a test point x belongs to a given class + or —. 

Let m, and m_ denote the number of positive and negative examples respec- 
tively, such that m = m4 + m_. From the projections of the training points (cf. 
(15.7), 


m 


qj = q(x j) = (w, P(x;)) = X, aik(xi, x)), (15.23) 
i=1 


1. For extensions of the SVM that produce probabilistic outputs, see [521, 486, 410]. 
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Conditional 
Probabilities 


we can readily estimate the mean of each Gaussian, 


1 
p= — By dis (15.24) 
ME y= 
and the respective variances 
1 
2 i— ps)’. 15.25 
o4 TE H+) (15.25) 


Note that the —1 in the denominator renders the variance estimator unbiased (cf. 
Chapter 3, and for instance [49]). The class-conditional densities take the form 


a 2 
plaly = +1) = Cro) exp ( -47A ). (15.26) 


In order to apply Bayes’ rule (Section 16.1.3), we need to determine prior proba- 
bilities of the two classes. These could either be known for the problem at hand 
(from some large set of previous observations, for example), or estimated from the 
current dataset. The latter approach only makes sense if the sample composition 
is representative of the problem at hand; it then amounts to setting 


m4 
P(y = +1) = —. 15.27 

y=+)= 2 (15.27) 
To obtain the conditional probabilities of class membership y = +1 given the pattern 
x (sometimes called posterior probabilities), we use Bayes’ rule, 


ply = +1) Ply = +1) 
paqly = DPY = 1) + paly = -Py = -1) 
where q = q(x) is defined as above. 

Being able to estimate the conditional probabilities can be useful, for instance, 
in applications where the output of a classifier needs to be merged with further 
sources of information. Another recent application is to classification in the pres- 
ence of noisy class labels [315]. In this case, we formulate a probabilistic model 
for the label noise. During learning, an EM procedure is applied to optimize the 
parameters of the noise model and of the KFD, as well as the conditional probabil- 
ities for the training patterns. The procedure alternates between the estimation of 
the conditional probabilities, as detailed above, and a modified estimation of the 
KFD which takes into account the conditional probabilities. 

In this way, a point that has been recognized as being very noisy has a smaller 
influence on the final solution. In [315], this approach was applied to the segmen- 
tation of images into sky and non-sky areas. A standard classification approach 
would require the hand-labelling of a large number of image patches, in order to 
get a sufficiently large training set. The new “noisy label” algorithm, on the other 
hand, is able to learn the task from images that are merely globally labelled ac- 
cording to whether they contain any sky at all. Images without sky are then used 
to produce training examples (image patches) of one class, while images with sky 
are used to produce noisy training examples of the second class. 


Ply = £1|x) = Ply = +14) = , (15.28) 
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15.5 Experiments 


Toy Example 


Benchmark 
Results 


Let us first use a toy example to illustrate the KFD algorithm (Figure 15.4). The 
three panels show the same toy data set, with the KFD algorithm run three times, 
using Gaussian kernels 


k(x, x!) = exp (- k=) (15.29) 


Ç 


with different values of c. For large c, the induced feature space geometry resem- 
bles that of the input space, and the algorithm computes an almost linear discrim- 
inant function. For the problem at hand, which is not linearly separable, this is 
clearly not appropriate. For a very small c, on the other hand, the kernel becomes 
so local that the algorithm starts memorizing the data. For an intermediate kernel 
width, a good nonlinear separation can be computed. Note that in KFD, there is no 
geometrical notion of Support Vectors lying on the margin; indeed, the algorithm 
does not make use of a margin. 

Applications of KFD to real world data are currently rather rare. Extensive 
benchmark comparisons were performed in [364], however, for a selection of 
binary classification problems available from [36]. As shown in Table 15.1, KFD 
performs very well, even when compared to state-of-the-art classifiers such as 
AdaBoost [174] and SVMs (Chapter 7). Performance comparisons on the USPS 
handwritten digit recognition task can be found in Table 7.4. 


Figure 15.4 KFD toy example. In all three cases, a linear Fisher discriminant was com- 
puted based on the data points mapped into the feature space induced by a kernel. We 
used a Gaussian kernel (see text) in all cases, with different values of the kernel width. On 
the left, a rather small width was used, leading to data memorization. On the right, a wide 
kernel was used, with the effect that the decision boundary is almost linear; again, this is not 
appropriate for the given task. For an intermediate kernel size (middle), a good nonlinear 
separation is obtained. In all panels, the solid black line gives the actual decision boundary, 
while the dashed lines depict the areas corresponding to the two hyperplanes in feature 
space that, when projected on the direction of discrimination, fall on the means of the two 
classes. 
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Table 15.1 Comparison [364] between Support Vector Machines (Chapter 7), the Kernel 
Fisher Discriminant (KFD), a single radial basis function classifier (RBF), AdaBoost (AB, 
[174]), and regularized AdaBoost (ABr, [428]) on 13 different benchmark datasets (see text). 
The best result in boldface, the second best in italics. 


SVM KFD RBF AB ABr 
Banana 11.5+0.07 | 10.8+0.05 | 10.80.06 | 12.30.07 | 10.90.04 
Breast Cancer || 26.0-+0.47 | 25.8+0.46 | 27.60.47 | 30.4+0.47 | 26.50.45 
Diabetes 23,540.17 | 23.2+0.16 | 24.3+0.19 | 26.5+0.23 | 23.8+0.18 
German 23.640.21 | 23.740.22 | 24.70.24 | 27.540.25 | 24.3+0.21 
Heart 16.040.33 | 16.140.34 | 17.60.33 | 20.3+0.34 | 16.50.35 
Image 3.0+0.06 3.30.06 3.30.06 | 2.70.07 | 2.7+0.06 
Ringnorm 1.70.01 1.5+0.01 1:70.02 1.90.03 1.6+0.01 
F. Sonar 32.4+0.18 | 33.20.17 | 34.40.20 | 35.70.18 | 34.20.22 
Splice 10.90.07 | 10.50.06 | 10.00.10 | 10.10.05 9.5+0.07 
Thyroid 4.80.22 4.2+0.21 4.50.21 4.4+0.22 4.60.22 
Titanic 22.4+0.10 | 23.2+0.20 | 23.30.13 | 22.6+0.12 | 22.60.12 
Twonorm 3.00.02 2.6+0.02 2.9+0.03 3.00.03 2.70.02 
Waveform 9.90.04 9.9+0.04 | 10.70.11 | 10.80.06 9.80.08 


15.6 Summary 


Kernel Fisher Discriminant (KFD) analysis can be considered a merge of SVM 
classifiers and Kernel PCA. As with any classifier, it takes into account the labels y; 
of the data; as in PCA, it finds an “interesting” direction in a dataset by maximizing 
a criterion involving variances. Finally, KFD analysis resembles both SVMs and 
Kernel PCA by operating in a feature space induced by a kernel function. 

The result is an algorithm which performs just as well as SVM classifiers; on 
some problems, it is actually slightly better, but it would be premature to draw 
any far-reaching conclusions from this. Its largest disadvantage compared with 
SVMs is that training procedures for KFD are not yet as well developed as those 
for SVMs. Until recently, KFD was only applicable to fairly small problems, as it 
required m x m matrices to be stored. In Section 15.3, we described new techniques 
which go some way in closing the gap between SVM training and KFD training; 
for very large datasets, however, it is an open question whether KFD analysis is 
competitive with sophisticated SVM training methods (Chapter 10). 

On the other hand, KFD has the advantage that it lends itself to a probabilistic 
interpretation, since its outputs can readily be transformed into conditional proba- 
bilities of class membership. If we care only about the final classification, this may 
not be of interest; however, there are applications where we are interested not only 
in a class assignment, but also in a probability to go with it. Judging from present 
day training methodologies, KFD should excel in medium sized problems of this 
type (m < 25000, say), which are large enough that Bayesian techniques such as 
the Relevance Vector Machine (Section 16.6) are too expensive to train. 
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15.7 Problems 


15.1 (Dual Eigenvalue Problem for KFD e) Derive (15.6) from the Rayleigh coeffi- 
cient. 


15.2 (Fisher Direction of Discriminination ee) Prove that the solution œ of the dual 
Fisher problem satisfies œ x N=! (u — p14). 


15.3 (Quadratic Program for KFD [364, 362] ee) Derive (15.10). 


15.4 (Fisher Loss Function ee) Discuss the differences between the Fisher loss function 
and the SVM loss function, comparing the quadratic programs (7.35) and (15.10). Also 
compare to the loss employed in [532]. 


15.5 (Relationship between KFD and SVM ee) Study the SVM training algorithm 
of Pérez-Cruz et al. [407], and show that within the working set, it computes the KFD 
solution. Argue that the difference between SVM and KFD thus lies in the working set 
selection. 


15.6 (Optimality of Fisher Discriminant ee) Prove that the KFD algorithm gives a 
decision boundary with the lowest possible error rate if the two classes are normally 
distributed with equal covariance matrices in feature space and the sample estimates of 
the covariance matrices are perfect. 


15.7 (Scale Invariance ee) Prove that the KFD decision boundary does not change if 
some direction of feature space is scaled by c 4 0. Hint: you do not need to worry about 
kernels. Just prove that the statement is true in a vector space. Note that it is sufficient to 
consider finite-dimensional vector spaces, as the data lie in a finite-dimensional subset of 
feature space. 

Argue that this invariance property does not hold true for (kernel) PCA. 


15.8 (KFD performs Regularized Regression on the Class Labels eee) Prove that 
a least-mean-squares regression (in feature space) on the class labels yields the same direc- 
tion of discrimination as KFD (for the case of standard Fisher Discriminant Analysis, cf. 
[49], for example; see also [609]). 

Discuss the role of regularization as described in Section 15.2. To what kind of regular- 
ization does regression on the labels correspond? 


15.9 (Conditional Class Probabilities vs. Probit ee) Discuss the connections between 
probit (see (16.5)) and the method for estimating conditional probabilities described in Sec- 
tion 15.4. 


15.10 (Multi-Class KFD [171, 186, 442] ee) Generalize the KFD algorithm to deal with 
M classes of patterns. In that case, there is no longer a one-dimensional direction of 
discrimination. Instead, the algorithm provides a projection on a M — 1-dimensional 
subspace of H. 
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The Bayesian approach to learning exhibits some fundamental differences with re- 
spect to the framework of risk minimization, which was the leitmotif of this book. 
The key distinction is that the former allows for a very intuitive incorporation of 
prior knowledge into the process of estimation. Moreover, it is possible, within 
the Bayesian framework, to obtain estimates of the confidence and reliability of 
the estimation process itself. These estimates can be computed easily, unlike the 
uniform convergence type bounds we encountered in Chapters 5 and 12. 

Surprisingly enough, the Bayesian approach leads to algorithms much akin to 
those developed within the framework of risk minimization. This allows us to pro- 
vide new insight into kernel algorithms, such as SV classification and regression. 
In addition, these similarities help us design Bayesian counterparts for risk min- 
imization algorithms (such as Laplacian Processes (Section 16.5)), or vice versa 
(Section 16.6). In other words, we can tap into the knowledge from both worlds 
and combine it to create better algorithms. 

We begin in Section 16.1 with an overview of the basic assumptions underlying 
Bayesian estimation. We explain the notion of prior distributions, which encode 
our prior belief concerning the likelihood of obtaining a certain estimate, and 
the concept of the posterior probability, which quantifies how plausible functions 
appear after we observe some data. Section 16.2 then shows how inference is 
performed, and how certain numerical problems that arise can be alleviated by 
various types of Maximum-a-Posteriori (MAP) estimation. 

Once the basic tools are introduced, we analyze the specific properties of 
Bayesian estimators for three different types of prior probabilities: Gaussian Pro- 
cesses (Section 16.3 describes the theory and Section 16.4 the implementation), 
which rely on the assumption that adjacent coefficients are correlated, Laplacian 
Processes (Section 16.5), which assume that estimates can be expanded into a 
sparse linear combination of kernel functions, and therefore favor such hypothe- 
ses, and Relevance Vector Machines (Section 16.6), which assume that the contri- 
bution of each kernel function is governed by a normal distribution with its own 
variance. 

Readers interested in a quick overview of the principles underlying Bayesian 
statistics will find the introduction sufficient. We recommend that the reader focus 
first on Sections 16.1 and 16.3. The subsequent sections are ordered in increasing 
technical difficulty, and decreasing bearing on the core issues of Bayesian estima- 
tion with kernels. 


470 


Prerequisites 


16.1 Bayesics 


Bayesian Kernel Methods 


16.4.1 Laplace 
Approximation 


16.1 Bayesian 16.3 Gaussian 
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This chapter is intended for readers who are already familiar with the basic con- 
cepts of classification and regression, as explained in the introduction (Chapter 1), 
and with the ideas underlying regularization (Chapter 4); in particular, the reg- 
ularized risk functional in Section 4.1. Knowledge of Maximum Likelihood (as 
in Section 3.3.1) is also required. The treatment of Gaussian Processes assumes 
knowledge of regularization operators and Reproducing Kernel Hilbert Spaces 
(Section 2.2.2). Details of the implementation of Gaussian Processes require knowl- 
edge of optimization (Chapter 6), especially Newton’s method (Section 6.2) and 
sparse greedy methods (Section 6.5). 


The central characteristic of Bayesian estimation is that we assume certain prior 
knowledge or beliefs about the data generating process, and the functional de- 
pendencies we might encounter. Let us begin with the data generation process. 
The discussion following is closely connected to the reasoning we put forward 
in Section 3.3.1 to explain maximum likelihood estimation. Unless stated other- 
wise, we observe an m-sample X := {x1,...,Xm} and Y := {y1,..., Ym}, based on 
which we will carry out inference. For notational convenience we sometimes use 
Z := {(x1, y1), -< <, (Xm, Ym)} instead of X, Y. We begin with an overview over the 
fundamental ideas (see also [49, 338, 383, 486, 432] for more details). 


16.1.1 Likelihood 


Assume that we are given a hypothesis f and information about the process 
that maps x into y. We can formalize the latter via the distribution P(y|x, f(x)), 
and if a density exists, via p(y|x, f(x)). The key difference to the reasoning so 
far is that we assume the distribution P(y|x, f(x)) is known (in Section 16.1.4, we 
relax this assumption to the knowledge of a parametric family through the use of 
hyperparameters). 


16.1 Bayesics 
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a For instance, f(x) could be the true wind speed at a location x, and y the observed 
speed, where the observation is corrupted by errors in the measurement process 
(imprecision of instruments, improper handling, etc.). In other words, we do not 
observe f(x) but rather f(x) + €, where € with corresponding P(€) is a random 
variable modelling the noise process. In this case, we obtain 


y = f(x) + € and P(y|x, f(x) = Ply — f(x) (16.1) 


= Likewise, y and f(x) could be binary variables, such as black and white pixels 
on an image. Thus, consider the case where f(x) is the color of the pixel at location 
x, and y the color of this pixel on a copy of the image received by a (noisy) fax 
transmission. In this case, we need only consider the probabilities 


P(y|f(x)) where y, f(x) € {41}, ie, P(1,1),P(, —1), P(-1,1), P(-1, —1) (16.2) 


m We might want to model the probability that a patient develops cancer, P(y = 
cancer|x), based on a set of medical observations x. We observe only the outcomes 
y = “cancer” or y = “no cancer”, however. One way of solving this problem is 
to use a functional dependency f(x) which can be transformed into P(y|x) via a 
transfer function 


We give three examples of such transfer functions below. 


Logistic Transfer Function: this is given by 


Ply = As, fe) == or (16.3) 
Note that logistic regression with (16.3) is equivalent to modelling 

=f SUS: 
f= y= IFO)” ge 


since p(y = —1|f(x)) = 1 — p(y = 1|f(x)). Solving (16.4) for p(y = 1|f(x)) yields 
(16.3). 


Probit: we might also assume that y is given by the sign of f, but corrupted by 
Gaussian noise (see for instance [395, 396, 486]); thus, y = sgn (f(x) + €) where 
é ~ N(0, a). In this case, we have 


plylfey = [EEDE eag (16.5) 


= = i j exp (-S) dé=© (22) (16.6) 


Here ® is the distribution function of the normal distribution. 

Label Noise: finally, we might want to do classification in the presence of random 
label noise (possibly in addition to the noise model po(y|t) discussed previously). 
In this case, a label is randomly assigned to observations with probability 27 (note 
that this is the same as randomly flipping with probability 7). We then write 


PUI) =n + (1 — 27) polylf)). (16.7) 
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Recall that in the (important) special case of iid generated data, the likelihood 
factorizes (3.18), and we obtain 


m 


P(Y|X, f) = [Pix fœ). (16.8) 
i=1 


As long as P(y;|x;, f(x;)) is indeed the underlying distribution, P(Y |X, f) tells us 
how likely it is that the sample X, Y was generated by f. In the following discus- 
sion, we use Bayes’ rule to turn this connection around, and consider how likely it 
is that f explains the data X, Y. Before we address this issue, however, we have to 
specify our assumptions about the hypotheses f that can be used. 


16.1.2 Prior Distributions 


When solving a practical estimation problem, we usually have some prior knowl- 
edge about the outcome we expect, be it a desire for smooth estimates, preference 
for a specific parametric form, or preferred correlations between certain dimen- 
sions of the inputs x; or between the predictions at different locations. In short, we 
may have a (possibly vague) idea of the distribution of hypotheses f, P(f), that we 
expect to observe. Before we proceed with the technical details, let us review some 
examples. 


= We may know that f is a linear combination of sin x, cosx,sin2x, and cos 2x, and 
that the coefficients are chosen from the interval [—1, 1]. In this case, we can write 
the density p(f) as 


ahi t if f = a sin x + az cos x + assin 2x + a4 cos 2x with a; € [—1,1] 
~ ) 0 otherwise 


This is a parametric prior on f 


= We may not know much more about f than that its values f(x;) are correlated 
and are distributed according to a Gaussian distribution with zero mean and 
covariance matrix K. For three values (we use f; as a shorthand), this leads to 


(fi, fos fs) = are? (Fh aT fs) (169) 
The larger the off diagonal elements K;;, the more the corresponding function 
values f(x;) and f(x;) are correlated. The main diagonal elements K;; provide the 
variance of f;, and the off diagonal elements the covariance between pairs f; and 
fj. Note that in this case we do not specify a prior assumption about the function 
f, but only about its values f(x;) at some previously specified locations. 

The choice of K as a name for the covariance is deliberate. As we will see in 
Section 16.3, K is identical to the kernel matrix used in Reproducing Kernel Hilbert 
Spaces and regularization theory. The idea is that observations are generated by a 
stochastic process with a given covariance structure. 
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Figure 16.1 Two functions (left) on [—1, 1] > R and their derivatives (right). Even though 
the top and bottom functions closely resemble each other, the top function has a higher 
prior probability of occurrence according to (16.10), since its value of || f’|| is smaller. 


a Finally, we may only have the abstract knowledge that smooth functions with 
small values are more likely to occur. Figure 16.1 is a pictorial example of such an 
assumption. One possible way of quantifying such a relation is to posit that the 
prior probability of a function occurring depends only on its L} norm and the L3 
norm of its first derivative. This leads to expressions of the form 


—Inp(f)=c +I IFIP + llOxfl. (16.10) 


In other words, non-smooth functions with large values of ||@;f||? and large func- 
tions are less likely to occur. 


Eq. (16.10) is an example of a nonparametric prior on f. As in the previous ex- 
ample, we will see that (16.10) leads to Gaussian Processes (Section 16.3). Further- 
more, we will point out the connection to regularization operators.! 

Now that we have stated our assumptions as to the probability of certain hy- 
potheses occurring, we will study the likelihood that a given hypothesis is respon- 
sible for a particular sample X, Y. 


1. As we shall see, the connection with regularization is that In p(f) = Q[f] + c. 
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16.1.3 Bayes’ Rule and Inference 


We begin with Bayes’ rule. In the previous two sections, we gave expressions for 
p(Y|f, X) and p(f). Let us further assume that p(f) and p(X) are independent.” 
We may simply combine the conditional probability with the prior probability to 
obtain 


PYF, X) pf) = pY, fX). (16.11) 
On the other hand, we might also condition f on Y, and decompose p(Y, f|X) into 
PEIX, Dp) = pY, FIX). (16.12) 


By combination of (16.11) and (16.12), we obtain Bayes’ rule, which allows us 
to solve for p(f|X, Y). The latter probability quantifies the evidence that f is the 
underlying function for X, Y. More formally, this reads as follows: 


PAIF X0PU/) = PEIX NP, and thus pgx, y= POP a6. 
Since f does not enter into p(Y), we may drop the latter from (16.13), leading to 
PIX, Y) x pOVIf, Xp). (16.14) 


Consequently, in order to assess which hypothesis f is more likely to occur, it is 
sufficient to analyze p(Y |f, X)p(f). Furthermore, we may recover p(Y) by comput- 
ing the normalization factor on the right hand side of (16.14). Finally, p(f|X, Y) 
also enables us to predict y at a new location x, using 


PIXY, = f PFOP Vaf (16.15) 


The quantity p(y|X, Y, x) tells us what observation we are likely to make at location 
x, given the previous observations X,Y. For instance, we could compute the 
expected value of the observation y(x) via 


(x) := E[y(x)|Z], (16.16) 


and specify the confidence in the estimate. One way of computing the latter is via 
tail-bounds on the probability of large deviations from the expectation, 


P (|y(x) — Ely(x)|Z]| > €|Z) < ô. (16.17) 


Unfortunately, evaluation of (16.17) can be expensive, and cannot be carried out 
analytically in most cases. Instead, we use approximation methods, which will be 
described in Section 16.2. 

In some cases, it is more natural for inference purposes to analyze p(y, Y) (in par- 
ticular, if the dependency on x can be absorbed in a corresponding prior probabil- 


2. In the absence of this assumption, we would have to find expressions for p(X, Y|f), 
which would lead to an analogous result. 
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ity over (y, Y)). Since p(y, Y) = p(y|Y)p(Y), however, we may easily obtain p(y|Y) 
once we know the normalizing factor p(Y). The latter is obtained by integration 
over y, 


= _ PY,Y) _ _ ply, Y) 
p(Y)= fru, Y)dy and thus p(y|Y) = mY) TAO (16.18) 


This process is called marginalization. 
16.1.4 Hyperparameters 


Consider (16.10). We might question whether the trade-off between ||f||? and 
\|O.f |? is correct. Thus, rather than (16.10), we might instead write — 1n p(f) = 
c + |f|? + wl|Oxf||*? for some w > 0, effectively changing the scale of our flatness 
assumption. Likewise, we may not always know the exact form of the likelihood 
function, but we might instead have a rough idea about the amount of additive 
noise involved in the process of obtaining y from f(x). Consequently, we need a 
device to encode our uncertainty about the hyperparameters used in the specifica- 
tion of the likelihood and prior probabilities. 

It is only natural to extend the Bayesian reasoning to these parameters by 
assuming a prior distribution on the hyperparameters themselves, and making 
the latter variables of the inference procedure. Generalizing from the previous 
example, we denote by w the vector of all hyperparameters needed in a particular 
situation. We obtain 


Pf a) = p(fle)p(w) and thus p(f) = f plf,wydeo = f plflu)pw)dw. (16.19) 


We call p(w) a hyperprior, since it is a prior assumption on the prior p(f) (or 
p(Y|f, X)) itself. In theory, we could integrate out the hypotheses f to obtain the 
posterior distribution over the hyperparameters w, 


plw|Z) x p(Zlee) p(w) = pw) | ZAPE, (16.20) 
and use the latter to obtain 
PEIZ = | Fle, Z)plw|Z)dew. (16.21) 


Again, as in (16.15), an analytic solution of the integral is unlikely to be feasible, 
and we must resort to approximations (see Section 16.2.2). 


16.2 Inference Methods 


In this section we describe techniques useful for inference with Bayesian kernel 
methods, and relate these to algorithms used in the risk minimization framework. 
Readers interested in the connection with statistical learning theory are encour- 
aged to read Section 12.3 on PAC-Bayesian bounds. 
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Figure 16.2 Left: The mode and mean of the distribution coincide, hence the MAP ap- 
proximation is satisfied. Right: For multimodal distributions, the MAP approximation can 
be arbitrarily bad. 


Methods other than those described below may also be suitable for this task; 
for the sake of brevity, however, we focus on techniques currently in common 
use. Other important techniques not covered here include Markov Chain Monte 
Carlo (MCMC) [426, 383, 384], and the Expectation Maximization (EM) algorithm 
[135, 275, 610, 193, 260]. 


16.2.1 Maximum a Posteriori Approximation 


In most cases, integrals over p(f|X, Y), such as the expectation of f in (16.15), are 
computationally intractable. This means that we have to use approximate tech- 
niques to make predictions. A popular approximation is to replace the integral 
over p(f|X, Y) by the value of the integrand at the mode of the posterior distribu- 
tion, where p(f|X, Y) is maximal. The hope is that p(f|X, Y) is concentrated around 
its mode, and that mode and mean will approximately coincide. We thus approxi- 
mate (16.15) by 


p(y|X, Y, x)= p(y|fuar, x) where fmar = argmax p(f|X, Y). (16.22) 
f 


We call fap the maximum a posteriori (MAP) estimate since it maximizes the 
posterior distribution p(f|X, Y) over the hypotheses f. In practice we obtain fap 
by minimizing the negative log posterior, 


far = arpa [—In p(f|Z)] = oon [—Inp(Z|f) -1n p(f)]. (16.23) 


The additional advantage of this method is that we completely avoid the issue of 
normalization, since (16.23) does not depend on p(Z). This approximation is justi- 
fied, for instance, if all distributions involved happen to be Gaussian, since mean 
and mode then coincide. See Figure 16.2 for an example where this assumption 
holds, and also for a counterexample. 
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We may also require an approximation in the integral over the hyperparameter 
w due to the hyperprior p(w). This situation occurs more frequently than the 
need to compute a MAP estimate, since a complicated prior distribution p(f) 
stemming from an integration over w will probably render any subsequent steps 
of integration intractable. Thus, we pick w according to 


Wap = argmax p(w|Z). (16.24) 
Ww 

In order to compute p(w|Z), we apply Bayes’ rule. We then obtain 

Wap = argmax P(Z|w)p(w) = argmin [—In p(Z|w) — In p(w)] . (16.25) 
w p(Z) w 


This procedure is sometimes referred to as the MAP2 estimate. A practical example 
of the use of hyperparameters is automatic relevance determination [383]. This 
addresses the proper scaling of observations, and the removal of inputs that prove 
to be irrelevant to the problem at hand. 


Remark 16.1 (Automatic Relevance Determination) Denote by n the dimensional- 


ity of x, by w := diag(w1, . . . , wn) a diagonal scaling matrix, and by 
p(w) = [ [p (16.26) 
i=1 


a factorizing prior on the hyperparameters wi > 0, possibly with p(w) > p(w’) if w > w’ 
(this facilitates the elimination of irrelevant parameters). Assume moreover that we already 
have a prior p(f) over hypotheses f. We can then form a prior distribution conditioned on 
a hyperprior by letting 


p(f lw) := p(f(w -)). (16.27) 


In other words, functions f(w +) have the same prior distribution conditioned on the 
hyperparameter w as their un-scaled counterparts f(-), with respect to the prior p(f). This 
scaling is particularly useful to weed out unwanted inputs and to find the right scaling 
parameters for the remainder. See [338, 383] for more detail. 


Another advantage (beyond the computational aspect) of the MAP2 approxima- 
tion is that it obviates any problems with unnormalized or improper priors p(w) 
on w; in other words, functions p(w) with integrals that do not amount to 1, or 
which are not integrable at all (see [539] or Section 16.6 for an example of a (log- 
scale) flat prior over a hyperparameter, where p(Inw) = const.). 

This convenience comes at a price, however: Estimates obtained using improper 
priors no longer derive from true probability distributions, and much of the mo- 
tivation for Bayesian techniques cannot then be justified. Nonetheless, these tech- 
niques work well in practice. We give an example of such a situation in Sec- 
tion 16.6. 
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16.2.2 Parametric Approximation of the Posterior Distribution 


Instead of replacing p(f|Z) by its mode, we may want to resort to slightly more 
sophisticated approximations. A first improvement is to use a normal distribution 
N(u, 0), with a mean p coincides with the mode of p(f|Z), and to use the second 
derivative of —Inp(fmap|Z) for the variance ø. This is often referred to as the 
Gaussian Approximation. In practice, we set (see for instance [338]) 


f\Z ~ N(ELf|Z], £~) where £ = —8? [In p(f|Z) (16.28) 


Netz 


The advantage of such a procedure is that the integrals remain tractable. This is 
also one of the reasons why normal distributions enjoy a high degree of popularity 
in Bayesian methods. Besides, the normal distribution is the least informative 
distribution (largest entropy) among all distributions with bounded variance. 

As Figure 16.2 indicates, a single Gaussian may not always be sufficient to cap- 
ture the important properties of p(y|X, Y,x). A more elaborate parametric model 
qolf) of p(f|X, Y), such as a mixture of Gaussian densities, can then be used to im- 
prove the approximation of (16.15). A common strategy is to resort to variational 
methods. The details are rather technical and go beyond the scope of this section. 
The interested reader is referred to [274] for an overview, and to [53] for an ap- 
plication to the Relevance Vector Machine of Section 16.6. The following theorem 
describes the basic idea. 


Theorem 16.2 (Variational Approximation of Densities) Denote by f,y random 
variables with corresponding densities p(f,y), p(fly), and p(f). Then for any density 
q(f), the following bound holds; 


inpyy= [InP Pacpar— fin apase finP Pa pat. (1629) 


Proof We begin with the first equality of (16.29). Since p(f, y) = p(f|y)p(y), we 
may aid 


p(f.y) pt ly) 
ww In p(y) + In = a) (16.30) 


Additionally, fn “GMaq(fdf = KL(p(f|y)|Iq(f)) is the Kullback-Leibler diver- 


gence between p(f|y) and q(f) [114]. The latter is a nonnegative quantity which 
proves the second part of (16.29). a 


The true posterior distribution is usually p(f |y), and q(f) an approximation of it. 
The practical advantage of (16.29) is that L := In PUD a f)df can often be computed 
more easily, at least for simple enough g(f). Furthermore, by maximizing L via a 
suitable choice of q, we maximize a lower bound on In p(y). 
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16.2.3 Connection to Regularized Risk Functionals 


A second glance at (16.23) reveals similarities between the log posterior — In p(f|Z) 
and the regularized risk functional Rreg[f] of (4.1). Both expressions are sums of 
two terms: One depending on f and Z (Remp[f] and —Inp(Z|f)), and the other 
independent of Z (AQ[f] and — In p(f)). In particular, if we formally set 


m 


MRempl f] = > c(i yi f(xi)) = — n p(Z|f), (16.31) 
AQIf] = —In p(f), (16.32) 


i=1 
we may obtain a Bayesian interpretation of regularized risk minimization as MAP 
estimation, and vice versa. We next discuss the interpretation of (16.31) and (16.32). 


= In the case of (16.31), recall that in Section 3.3.1 we assume that we are deal- 
ing with iid (independent and identically distributed) data, and that we know the 
dependency between the hypothesis f and the observations (x;, y;). As a conse- 
quence, we can write p(Z|f) as a product involving one pair of observations (x;, y;) 
at a time. Finally, Remark 3.6 shows that if we set c(x, y, f(x)) = — 1n p(yi|xi, f), we 
obtain direct correspondence in the data dependent part. 

As described in Chapter 3, this means that the loss function in the regularized 
risk functional is the equivalent of the negative log likelihood in the probabilistic 
setting. For instance, squared loss corresponds to the assumption that normal 
noise is added to the data. Similar conclusions can be drawn for classification, with 
certain known caveats due to non-normalizability of some of the loss functions 
commonly used in the risk functional context [521]. 


= The correspondence AQ[f] = —In p(f) in (16.32) shows that the choice of the 
regularizer influences the choice of the final estimate to the same extent as a 
prior over a function class. For instance, the choice of a particular feature space 
when using kernel methods acts in the same way as a prior over the class of 
possible functions in Bayesian estimation. This is an important fact to keep in 
mind when dealing with “distribution free” and “nonparametric” estimators. In 
effect, through a particular choice of regularization, and the consequent imposition 
of a partial order (roughly speaking, a ranking) on the set of possible solutions, 
we are selecting a particular prior. The only difference is that we do not use 
the probabilistic part of — In p(f) when dealing with Q[f] but merely compare 
different f , f’ by the corresponding size of Qf]. 


Bear in mind that the correspondence between regularized risk minimization 
and Bayesian methods only works for algorithms maximizing the log posterior 
to obtain a MAP solution. The reasoning does not go beyond this point, and in 
particular the risk functional approach has no equivalent of the averaging process 
involved in obtaining the mean, rather than the mode, of a distribution. 
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Empirical Risk S c(xi, yi f(xi)) | neg. log-likelihood — X In p(yi|f(«,)) 


Regularization [f] neg. log-prior —Inp(f) 


Regularized Risk Remp[f] + AQLf] | neg. log-posterior —1n p(Z|f)— 1n p(f) 
Risk Minimizer MAP Estimate 


16.2.4 Translating Notations 


For the sake of clarity, we present a table which puts corresponding quantities 
from Bayesian estimation and the risk functional approach side by side. Needless 
to say, the table is a gross oversimplification of the deeper connections, but it still 
may be useful for “decoding” scientific literature using a different framework. 


16.3 Gaussian Processes 


Gaussian Processes are based on the “prior” assumption that adjacent observa- 
tions should convey information about each other. In particular, it is assumed that 
the observed variables are normal, and that the coupling between them takes place 
by means of the covariance matrix of a normal distribution. Eq. (16.9) is an exam- 
ple of such a coupling; the entries of the matrix K;; tell us the correlation between 
the observations f; and fj. 

It turns out that this is a convenient way of extending Bayesian modelling of 
linear estimators to nonlinear situations (cf. [601, 596, 486]). Furthermore, it repre- 
sents the counterpart of the “kernel trick” in methods minimizing the regularized 
risk. We now present the basic ideas, and relegate details on efficient implementa- 
tion of the optimization procedure required for inference to Section 16.4. 


16.3.1 Correlated Observations 


Assume we are observing function values f(x;) at locations xj, as in (16.9). It is only 
natural to assume that these values are correlated, depending on their location 
xi. Indeed, if this were not the case, we would not be able to perform inference, 
since by definition, independent random variables f(x;) do not depend on other 
observations f(x;). 

In fact, we make a stringent assumption regarding the distribution of the f(x;), 
namely that they form a normal distribution with mean u and covariance ma- 
trix K. We could of course assume any arbitrary distribution; most other settings, 
however, result in inference problems that are rather expensive to compute. Fur- 
thermore, as Theorem 16.9 will show, there exists a large class of assumptions on 
the distribution of f(x;) that have a normal distribution as their limit. 

We begin with two observations, f(x1) and f(x2), for which we assume zero 
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3/4 3/4 
sponding density of the random variables f(x1) and f(x2). Now assume that we 
observe f(x). This gives us further information about f(x2), which allows us to 
state the conditional density? 

_ pf (x1), f%2)) 


PEDI) = ately (16.33) 


Once the conditional density is known, the mean of f(x2) need no longer be 0, 
and the variance of f(x2) is decreased. In the example above, the latter becomes $ 
instead of ł — we have performed inference from the observation f(x1) to obtain 
possible values of f (x2). 

In a similar fashion, we may infer the distribution of f(x;) based on more 
than two variables, provided we know the corresponding mean js and covariance 
matrix K. This means that K determines how closely the prediction relates to the 
previous observations f(x;). In the following section, we formalize the concepts 
presented here and show how such matrices K can be generated efficiently. 


mean js = (0,0) and covariance K = . Figure 16.3 shows the corre- 


16.3.2 Definitions and Basic Notions 


Assume we are given a distribution over observations t; at the points x1,...,Xm- 
Rather than directly specifying that the observations t; are generated from an 
underlying functional dependency, we simply assume that they are generated by 
a Gaussian Process.* Loosely speaking, Gaussian processes allow us to extend the 
notion of a set of random variables to random functions. More formally, we have 
the following definition: 


Definition 16.3 (Gaussian Process) Denote by t(x) a stochastic process parametrized 
by x € X (X is an arbitrary index set). Then t(x) is a Gaussian process if for any m € N 
and {x1,...,Xm}C X, the random variables (t(x1),...,t(Xm)) are normally distributed. 


We denote by k(x, x’) the function generating the covariance matrix 

Ki= c0v 1 )y2s 1 Gee) t (16.34) 
and by u the mean of the distribution. We also write Kj; = k(x;,x;). This leads to 
(t(x1),..+5£(%m)) ~ N(u, K) where u € R”. (16.35) 


3. A convenient trick to obtain p(f(x2)|f(x1)) for normal distributions is to consider 
p(f (x1), f(x2)) as a function only of f(x2), while keeping f (x1) fixed at its observed value. 
The linear and quadratic terms then completely determine the normal distribution in f (x2). 
4. We use t; for the random variables of the Gaussian process, since they are not the labels 
or target values y; that we observe at locations x;. Instead, t; are corrupted by noise €; to 
yield the observed random variables y;. 
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Figure 16.3 Normal distribution with two variables. Top left: normal density 
p(f (x1), f(x2)) with zero mean and covariance K; Top right: contour plot of p(f(x1), f(x2)); 
Bottom left: Conditional density of f(x2) when f(x;) = 1; Bottom left: Conditional density 


of f(x2) when f(x1) = —2. Note that in the last two plots, f(x2) is normally distributed, but 
with nonzero mean. 


Remark 16.4 (Gaussian Processes and Positive Definite Matrices) The function 
k(x, x’) is well defined, symmetric, and the matrix K is positive definite (cf. Definition 2.4). 


Proof We first show that k(x, x’) is well defined. By definition, 
[cov-{t(x1),...,£(%m)}] = cov {t(x;), t(x;)}. (16.36) 
Consequently, Kj; is only a function of two arguments (x; and xj), which shows that 
k(x, x’) is well defined. 

It follows directly from the definition of the covariance that k is symmetric. 


Finally, to show that K is positive definite, we have to prove for any a € R” that 
the inequality a! Ka > 0 holds. This follows from 


0 < Var (5 oats) =a! [cov {t(x;), Hx }] a= al Ka. (16.37) 


i=1 


Thus K is positive definite and the function k is an admissible kernel. E 
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Note that even if k happens to be a smooth function (this turns out to be a reason- 
able assumption), the actual realizations t(x), as drawn from the Gaussian process, 
need not be smooth at all. In fact, they may be even pointwise discontinuous. 

Let us have a closer look at the prior distribution resulting from these assump- 
tions. The standard setting is u = 0, which implies that we have no prior knowl- 
edge about the particular value of the estimate, but assume that small values are 


preferred. Then, for a given set of (t(x1), - - -, f(%m)) =: t, the prior density function 
p(t) is given by 

m 1 1 
p(t) = (27)~ 2 (det K)~2 exp (-317K"*) ‘ (16.38) 


In most cases, we try to avoid inverting K. By a simple substitution, 
t= Ka, (16.39) 


we have a ~ N(0, K~!), and consequently 
pla) = (277)~? (det K)~? exp (-30"Ka) (16.40) 


Taking logs, we see that this term is identical to Q[f] from the regularization 
framework (4.80). This result thus connects Gaussian process priors and estimators 
using the Reproducing Kernel Hilbert Space framework: Kernels favoring smooth 
functions, as described in Chapters 2, 4, 11, and 13, translate immediately into 
covariance kernels with similar properties in a Bayesian context. 


16.3.3 Simple Hypotheses 


Let us analyze in more detail which functions are considered simple by a Gaussian 
process prior. As we know, hypotheses of low complexity correspond to vectors 
y for which y' K~'y is small. This is in particular the case for the (normalized) 
eigenvectors v; of K with large eigenvalues A;, since 


Kv; = jv; yields v; Kw; = pia (16.41) 


In other words, the estimator is biased towards solutions with small a This 
means that the spectrum and eigensystem of K represent a practical means of 
actually viewing the effect a certain prior has on the degree of smoothness of the 
estimates. 

Let us consider a practical example: For a Gaussian covariance kernel (see also 
(2.68), 


k(x, x!) = exp (E) : (16.42) 


2w2 


where w = 1, and under the assumption of a uniform distribution on [—5,5], we 
obtain the functions depicted in Figure 16.4 as simple base hypotheses for our 
estimator. Note the similarity to a Fourier decomposition: This means that the 
kernel has a strong preference for slowly oscillating functions. 
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Figure 16.4 Hypotheses corresponding to the first eigenvectors of a Gaussian kernel of 
width 1 over a uniform distribution on the interval [—5,5]. From top to bottom and from 
left to right: The functions corresponding to the first eight eigenvectors of K. Lower right: 
the first 20 eigenvalues of K. Note that most of the information about K is contained in the 
first 10 eigenvalues. The plots were obtained by computing K for an equidistant grid of 
200 points on [—5,5]. We then computed the eigenvectors e of K, and plotted them as the 
corresponding function values (this is possible since for œ = e we have Ka = aq). 


16.3.4 Regression 


Let us put the previous discussion to practical use. For the sake of simplicity, we 
begin with regression (we analyze classification in Section 16.3.5). For regression 
estimation, we usually assume additive noise on top of the process generating 
t(x;); that is, rather than observing t(x;) directly, we observe f(x;) corrupted by 
noise, 


yi := t(x;) + & and thus €; = y; — t(x;), (16.43) 


where €; are independent random variables with zero mean. In order to keep 
our notation simple, however, we assume that all & are drawn from the same 
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distribution®, € ~ p(€). This allows us to state the likelihood p(y|t(x)) as 
p(yilt(xi)) = pyi — t(x:)). (16.44) 


In other words, we eliminate the random variable €; via (16.43). The posterior 
distribution is then given by 


p(tly) x plylt)p(t) (16.45) 


1 1 
= In p(yi— K| Toga (-3"« ) 3 
To perform inference, we have to specify the distribution connecting t and y. We 
could use any possible distribution, such as those in Table 3.1. 

A popular choice in Gaussian Process regression is to assume additive normal 
noise, é; ~ N(0, a). This has several advantages. First, all the distributions involved 
in the process of inference remain normal, which allows us to compute exact so- 
lutions. Second, as we will show below, we may find the mode of the distribution 
by a simple matrix inversion. After substituting t = Ka, taking logarithms, and 
ignoring terms independent of a, we obtain 


1 1 
In p(aly) = =z — Kal? — za Ka +c. (16.46) 


This is clearly a normal distribution, thus the mode and mean coincide, and the 
MAP approximation (16.22) becomes exact. The latter is obtained by maximizing 
(16.46) for œ, which yields 


a= (K + 0°1) ty. (16.47) 


Knowing a allows us to predict y at a new location x. The Bayesian reasoning, 
however, also allows us to associate a level of confidence with the estimate. For 
normal distributions it suffices to know the variance. 

One way to obtain this information is to write (16.45) for an m + 1 dimensional 
system where Ym+1 is unknown and compute the variance of ym+1. There exists a 
more elegant way of obtaining the variance Var y,,,1 for additive Gaussian noise, 
however. Since y is a sum of two Gaussian random variables, its covariance is 
given by (K + 071), and thus 


p(y, Ym+1) (16.48) 
-1 T 
lly K+o71 k y 
oexp | —= A j 
Ym41 k k(Xm+1, Xm-41) +a Ym+1 
Here K is an m x m matrix and k = [k(x1, Xm-41),.. +; K(%m5Xm-41)]. Since we already 


know y, we can obtain the variance of ym+1 in p(y|y) by computing the lower right 


5. This assumption is made for computational convenience only. We would otherwise have 
to consider different p;(€;) for 1 < i < m. The likelihood still factorizes in this case, but the 
observations can no longer be treated equally. 
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entry of the square matrix in (16.48). We can check (see for instance [337]) that this 
is given by 


Var Ya = K(Xm41,Xmi1) + 0? — k' (K + 071)"'k. (16.49) 
From (16.49) and (16.47), we conclude that y(x 741) is normally distributed with 
Y(Xm+1) ma N(K(K F o71)"y, kX my1, Xm+1) oF o’ =a k'(K + ei) ki: (16.50) 


In most other cases, such as for round-off noise, Laplacian noise, etc. , an exact 
solution for the posterior probability is not possible, and we have to make do with 
approximations. While we do not discuss this subject further in the current section 
(see [486] for more detail), we return to the issue of approximating a posterior 
distribution in Sections 16.5 and 16.6, where the form of the prior distribution 
makes an exact computation of p(f|X, Y) difficult. 


16.3.5 Classification 


For the sake of simplicity we limit ourselves to the case of two classes; that 
is, to binary classification (see for instance [486, 600] for details on multi-class 
discrimination). Rather than attempting to predict the labels y; € {+1} directly, 
we use logistic regression. Hence we try to model the conditional probabilities 
P(y = 1|x) and P(y = —1|x) alike. A popular choice is to posit a functional form for 
the link between f(x) and y, such as (16.3), (16.5), or (16.7). 

Matters are slightly easier for classification than for regression: provided we are 
able to find a hypothesis f (or a distribution over such hypotheses), we immediately 
know the confidence of the estimate. Thus, P(y = 1|x) not only tells us whether 
the estimator classifies x as +1 or —1, but also the probability of obtaining these 
labels. Therefore, calculations regarding the variation of f are not as important as 
they were in Section 16.3.4 (16.49). 

Let us proceed with a formal statement of a Gaussian process classification 
model. The posterior density is given by 


p(f|Z) x p(Y|X, t) (16.51) 
= [ipon exp (—317K"t), (16.52) 

where t = (Hen), ...,t(Xm)). With the transformation t = Ka, and thus 

t(x) = 5 ajk(xj, x), (16.53) 
j=l 

the negative logarithm of the posterior density becomes 

= Inp(fIX, Y) = ¥ In plyiltexd) + 507 Kaw (16.54) 


If we adopt the MAP (maximum a posterior) methodology of Section 16.2.1, 
inference is carried out by searching the mode of the density p(f|X, Y). Therefore, 
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the estimation problem can be reduced to a nonlinear function minimization 
problem, as in the regression case 16.3.4. We give examples of such techniques 
in Section 16.4. 


16.3.6 Adjusting Hyperparameters for Gaussian Processes 


More often than not, we will not know beforehand the exact amount of additive 
noise, or the specific form of the covariance kernel. To address this problem, the 
hyperparameter formalism of Section 16.1.4 is needed. To avoid technicalities, we 
only discuss the application of the MAP2 estimate (16.24) for the special case of 
regression with additive Gaussian noise, and refer the reader to [600, 193, 151, 426] 
and the references therein for integration methods based on Markov Chain Monte 
Carlo approximations (see also [486] for a more recent overview). 

We denote by w the set of hyperparameters we would like to adjust. In more 
compact notation, (16.48) becomes (now conditioned on w) 


1 
Or)" det(K + 021) 


where K, ø are functions of w. In other words, (16.55) tells us how likely it is that 
we observe y, if we know w. 

Recall that the basic idea of the MAP2 estimate (16.24) is to maximize p(wly) 
by maximizing p(y|w)p(w). In practice, this is achieved by gradient ascent (see 
Section 6.2.2) or second order methods (see Section 6.2.1 for Newton’s method) 
on p(y|w)p(w). Both cases require information about the gradient of (16.55) with 
respect to w. We give an explicit expression for the gradient below. 

Since the logarithm is monotonic, we can equivalently minimize the negative 
log posterior, In p(y|w) p(w). With the shorthand Q := K + 071, we obtain 


ply|w) = exp (-p¥"(K + oy) (16.55) 


Ay [- In p(ylw) p(w)] 
= 5d.(in det Q)- 50 [yoy] — ð In pw) (16.56) 


1 1 
= -3 tr (QQ) + 5y" Q7 (8u Q) QTY — ôu ln p(w). (16.57) 


Here (16.57) follows from (16.56) via standard matrix algebra [337]. Likewise, we 
could compute the Hessian of In p(y|w)p(w) with respect to w and use a second 
order optimization method.® 

If we assume a flat hyperprior (p(w) = const.), optimization over w simply 
becomes gradient descent in —In p(y|w); in other words, the term depending 
on p(w) vanishes. Computing (16.57) is still very expensive numerically since it 
involves the inversion of Q, which is an m x m matrix. 

There exist numerous techniques, such as sparse greedy approximation meth- 


6. This is rather technical, and the reader is encouraged to consult the literature for further 
detail [339, 426, 383, 197]. 
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ods, to alleviate this problem. We present a selection of these techniques in the 
following section. Further detail on the topic of hyperparameter optimization can 
be found in Section 16.6, where hyperparameters play a crucial role in determining 
the sparsity of an estimate. 


16.4 Implementation of Gaussian Processes 


In this section, we discuss various methods to perform inference in the case 
of Gaussian process classification or regression. We begin with a general pur- 
pose technique, the Laplace approximation, which is essentially an application of 
Newton's method (Section 6.2.1) to the problem of minimizing the negative log- 
posterior density. Since it is a second order method, it is applicable as long as the 
log-densities have second order derivatives. Readers interested only in the basic 
ideas of Gaussian process estimation may skip the present section. 

For classification with the logistic transfer function we present a variational 
method (Section 16.4.2), due to Jaakkola and Jordan [260], and Gibbs and MacKay 
[197, 198], a linear system of equations for optimization purposes. 

Finally, the special case of regression in the presence of normal noise admits 
very efficient optimization algorithms based on the approximate minimization of 
quadratic forms (Section 16.4.3). We subsequently discuss the scaling behavior and 
approximation bounds for these algorithms. 


16.4.1 Laplace Approximation 


In general the negative log posterior (16.54), which is minimized to obtain the 
MAP estimate, is not quadratic, hence the minimum cannot be found analytically 
(compare with (16.47), where the minimizer can be stated explicitly). A possible 
solution is to make successive quadratic approximations of the negative log poste- 
rior, and minimize the latter iteratively. This strategy is referred to as the Laplace 
approximation” [525, 600, 486]; the Newton-Raphson method, in numerical analy- 
sis (see [530, 423]); or the Fisher scoring method, in statistics. 

A necessary condition for the minimum of a differentiable function g is that its 
first derivative be 0. For convex functions, this requirement is also sufficient. We 
approximate g’ linearly by 


/ 
g'(x + Ax) ~ g'(x) + Axg”(x), and hence Ax = a i (16.58) 


7. Strictly speaking, the Laplace approximation refers only to the fact that we approximate 
the mode of the posterior by a Gaussian distribution. We already use the Gaussian approxi- 
mation in the second order method, however, in order to maximize the posterior. Hence, for 
all practical purposes, the two approximations just represent two different points of view 
on the same subject. 
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Substituting In p(f|X, Y) into (16.58) and using the definitions 


ci (—Oxx,) In p(ys|t(x1)), raig — Oian) In P(Ym|t(Xm))) 3 (16.59) 
C := diag (—Of,, In pilt), «-- — 3x) 1 P(YmlE(Xm))) 5 (16.60) 


we obtain the following update rule for œ (see also Problem 16.12), 
Anew = (KC + 1)! (KCa@oa — 0). (16.61) 


While (16.61) is usually an efficient way of finding a maximizer of the log posterior, 
it is far from clear that this update rule is always convergent (to prove the latter, we 
would need to show that the initial guess of a lies within the radius of attraction; 
see Problem 16.11). Nonetheless, this approximation turns out to work in practice, 
and the implementation of the update rule is relatively simple. 

The major stumbling block if we want to apply (16.61) to large problems is 
that the update rule requires the inversion of an m x m matrix. This is costly, 
and effectively precludes efficient exact solutions for problems of size 5000 and 
beyond, due to memory and computational requirements. If we are able to provide 
a low rank approximation of K by 


K = U' Koap U where U € R"*” and Koub E€ RY” (16.62) 


with n < m, however, we may compute (16.61) much more efficiently. Prob- 
lem 16.15 covers the computation of U given Kgyp. It follows immediately from 
the Sherman-Woodbury-Morrison formula [207], 


(VARER ) = V =v Re IRV CR) RY, (16.63) 


that we obtain the following update rule for K, 
=4 
= (1 -uT (Kit 4 ucu”) uc) (UTKabUCaona — 0). (16.64) 


In particular, the number of operations required to solve (16.61) is O(mn? + n’) 
rather than O(m’). 

There are several ways to obtain a good approximation of (16.62). One way is to 
project k(x;,x) on a random subset of dimensions, and express the missing terms 
as a linear combination of the resulting sub-matrix (this is the Nyström method 
proposed by Seeger and Williams [603]). We might also construct a randomized 
sparse greedy algorithm to select the dimensions (see Section 10.2 for more de- 
tails), or resort to a positive diagonal pivoting strategy [169]. 

An approximation of K by its leading principal components, as often done in 
machine learning, is usually undesirable, since the computation of the eigensys- 
tem would still be costly, and the time required for prediction would still rise with 
the number of observations (since we cannot expect the leading eigenvectors of K 
to contain a significant number of zero coefficients). 
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Figure 16.5 Variational Approximation for u = v = 0.5. Note that the quality of the 
approximation varies widely, depending on the value of f(x). 


16.4.2 Variational Methods 


In the case of logistic regression, Jaakkola and Jordan [260] compute upper and 
lower bounds on the logistic (1 + e~')~!, by exploiting the log-concavity of (3.5): 
A convex function can be bounded from below by its tangent at any point, and 
from above by a quadratic with sufficiently large curvature (provided the maxi- 
mum curvature of the original function is bounded). These bounds are (see also 
exercise 16.13) 


ply = 11) > ew (SS? — renee), (16.65) 
ply = 119) < exp (ut — HW), (16.66) 


where u,v € [0,1] and A(v) = a Bt Furthermore, H(z) is the binary entropy 
function, 


H(p) = -un p — (1 — p)In(1 — p). (16.67) 


Likewise, bounds for p(y = —1|f) follow directly from p(y = —1|t) =1 — p(y = 1|£). 
Equations (16.66) and (16.65) can be calculated quite easily, since they are linear 
or quadratic functions in t. This means that for fixed parameters js and v, we 
can optimize an upper and a lower bound on the log posterior using the same 
techniques as in Gaussian process regression (Section 16.3). 
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Approximations (16.66) and (16.65) are only tight, however, if v, u are chosen 
suitably. Therefore we have to adapt these parameters at every iteration (or after 
each exact solution), for instance by gradient descent. See [198, 196] for details. As 
in the previous section, we could use the Sherman-Woodbury-Morrison formula 
to invert the quadratic terms more efficiently. The implementation is analogous 
to the application of this result in the previous section, hence we will not go into 
further detail. 


16.4.3 Approximate Solutions for Gaussian Process Regression 


The approximations of Section 16.4.1 indicate that one of the more efficient ways 
of implementing Gaussian process estimation on large amounts of data is to 
find a low rank approximation® of the matrix K. Such an approximation is very 
much needed in practice, since (16.45) and (16.51) show that exact solutions of 
Gaussian Processes can be hard to come by. Even if œ is computed beforehand 
(see Table 16.1 for the scaling behavior), prediction of the mean at a new location 
still requires O(m) operations. In particular, memory requirements are O(m7) to 
store K, and CPU time for matrix inversions, as are typically required for second 
order methods, scales with O(m’). 

Let us limit ourselves to an approximation of the MAP solution. One of the crite- 
ria to impose is that the posterior probability at the approximate solution be close 
to the maximum of the posterior probability. Note that this requirement is different 
from the requirement of closeness in the approximation itself, as represented for 
instance by the expansion coefficients (the latter requirement was used by Gibbs 
and Mackay [197]). Proximity in the coefficients, however, is not what we want, 
since it does not take into account the importance of the individual variables. For 
instance, it is not invariant under transformations of scale in the parameters. 

For the remainder of the current section, we consider only additive normal 
noise. Here, the log posterior takes a quadratic form, given by (16.46). The follow- 
ing theorem, which uses an idea from [197], gives a bound on the approximation 
quality of minima of quadratic forms and is thus applicable to (16.46). 


Theorem 16.5 (Approximation Bounds for Quadratic Forms [503]) Denote by K € 
R"*" asymmetric positive definite matrix, y, œ € IR", and define the two quadratic forms 


T 1 
Q(a) := -y Ka + 5a (0°K + K'K)a, (16.68) 


Q*(a) := -y'a + Lato + Kja. (16.69) 


8. Tresp [546] devised an efficient way of estimating f(x) if the test set is known at the 
time of training. He proceeds by projecting the estimators on the subspace spanned by the 
functions k(%;,-), where %; are the training data. Likewise, Csató and Opper [128] design 
an iterative algorithm that performs gradient descent on partial posterior distributions and 
simultaneously projects the estimates onto a subspace. 
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Suppose Q and Q* have minima Qmin and Qž in: Then for all a, œ* € R” we have 
1 oe 
Qla) >Qmin= —5llyll? — 7° Q*(a"*), (16.70) 
o, g szi 
Qa") > Qin 2? (=F lly? - QQ), (16.71) 


with equalities throughout when Q(a) = Qmin and Q*(a*) = Qin: 
Hence, by minimizing Q* in addition to Q, we can bound Q’s closeness to the 
optimum and vice versa. 


Proof The minimum of Q(a) is obtained for Qopt = (K + a71)~'y (which also 
minimizes Q*)? hence 
1 : 1 

Qmin = -5y K(K + 071)"'y and Qin = =y K +071)"'y. (16.72) 
This allows us to combine Qmin and Q* i, tO Qmin + 07 Q* in = —§llyl|?. Since by 
definition Q(@) > Qmin for all æ (and likewise Q*(a*) > Q*,,.. for all a*), we may 
solve Qmin + 07Q*,., for either Q or Q* to obtain lower bounds for each of the two 
quantities. This proves (16.70) and (16.71). a 
Equation (16.70) is useful for computing an approximation to the MAP solution 
(the objective function is identical to Q(a), ignoring constant terms independent 
of a), whereas (16.71) can be used to obtain error bars on the estimate. To see this, 
note that in calculating the variance (16.49), the expensive quantity to compute is 
—k'(K +071)~'k. This can be found as 


—k™(K + 0?1)7’k = 2 min [-kTa +tal (0714 K) a (16.73) 
Q m 


however. A close look reveals that the expression inside the parentheses is Q* (œ) 
with y = k (see (16.69)). Consequently, an approximate minimizer of (16.73) gives 
an upper bound on the error bars, and lower bounds can be obtained from (16.71). 
In practice, we use the relative discrepancy between the upper and lower bounds, 


a .— AQla) + 0? Q*(a*) + allyl?) 
gap(a, aœ“) := -Oa +0270"(a") + Tyl (16.74) 


to determine how much further the approximation has to proceed. 
16.4.4 Solutions on Subspaces 
The central idea of the algorithm below is that improvements in speed can be 


achieved by a reduction in the number of free variables. Denote by P € R”*” with 
m >n and m,n € N an extension matrix (in other words, PT is a projection), with 


9. If K does not have full rank, Q(q) still attains its minimum value for op. There will 
then be additional a’ that minimize Q(a), however. 
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Table 16.1 Computational Cost of Various Optimization Methods. Note that n < m, and 
that different values of n are used in Conjugate Gradient, Sparse Decomposition, and 
Sparse Greedy Approximation methods: ncg < nsp < sga, since the search spaces are 
progressively more restricted. Near-optimal results are obtained when « = 60. 


Initialization | O(m°) O(nm?) O(n?m) O(kn?m) 
(= Training) 


Prediction: 
Mean O(m) O(m) O(n) O(n) 
Error Bars O(m?) O(nm?) O(n?m) or O(n?) | Olkn?m) or O(n?) 


PTP =1. We make the ansatz 
ap := PB where 3 € R”, (16.75) 


and find solutions 3 such that Q(a@p) (or Q*(@p)) is minimized. The solution is 


Bop = (PT (0?K + KTK) P) | PTKTy. (16.76) 


If P is of rank m, this is also the solution of (16.46) (the minimum negative log 
posterior for all œ € R”). In all other cases, however, it is an approximation. 

For a given P € R”*", let us analyze the computational cost involved in comput- 
ing (16.76). We need O(nm) operations to evaluate P' Ky, O(n?m) operations for 
(KP)'(KP), and O(n?) operations for the inversion of an n x n matrix. This brings 
the total cost to O(n?m). Predictions require k' œ, which entails O(n) operations. 
Likewise, we may use P to minimize Q*(P3*), which is needed to upper-bound 
the log posterior. The latter costs no more than O(n’). 

To compute the posterior variance, we have to approximately minimize (16.73), 
which can done for a = P£ at cost O(n?) . If we compute (PKP')-! beforehand, 
the cost becomes O(n’), and likewise for upper bounds. In addition to this, we 
have to minimize —kT KP + 48! P™(a?K + K'K)P{, which again costs O(n?m) 
(once the inverse matrices have been computed, however, we may also use them to 
compute error bars at different locations, thus limiting the cost to O(n7)). Accurate 
lower bounds on the error bars are not especially crucial, since a bad estimate leads 
at worst to overly conservative confidence intervals, and has no further negative 
effect. Finally, note that we need only compute and store KP — that is, the m x n 
sub-matrix of K — and not K itself. Table 16.1 summarizes the scaling behavior of 
several optimization algorithms. 

This leads us to the question of how to choose P for optimum efficiency. Possi- 
bilities include using the principal components of K [602], performing conjugate 
gradient descent to minimize Q [197], performing symmetric diagonal pivoting 
[169], or applying a sparse greedy approximation to K directly [513]. Yet these 
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methods have the disadvantage that they either do not take the specific form of 
y into account [602, 513, 169], or lead to expansions that cost O(m) for prediction, 
and require computation and storage of the full matrix [602, 197]. 

By contrast to these methods, we use a data adaptive version of a sparse greedy 
approximation algorithm. We may then only consider matrices P that are a collec- 
tion of unit vectors e; (here (e;); = 6;;), since these only select a number of rows of 
K equal to the rank of P. The details follow the algorithmic template described in 
Section 6.5.3. 


= First, for n = 1, we choose P = e; such that Q(P 8) is minimal. In this case we 
could permit ourselves to consider all possible indices i € {1,...m} and find the 
best one by trying all of them. 


= Next, assume that we found a good solution P8, where P contains n columns. 
In order to improve this solution, we expand P into the matrix Prew := [Poia, ei] € 
R"x0+1) and seek the best e; such that Phew minimizes ming Q(Prew/3). 


Note that this method is very similar to Matching Pursuit [342] and to iterative 
reduced set Support Vector algorithms (see Section 18.5 and [474]), with the differ- 
ence that the target to be approximated (the full solution æ) is only given implicitly 
via Q(a). 

Recently Zhang [613] proved lower bounds on the rate of sparse approximation 
schemes. In particular, he shows that most subspace projection algorithms enjoy 
at least an O(n7') rate of convergence. See also [614] for details on further greedy 
approximation methods. 


16.4.5 Implementation Issues 


Performing a full search over all possible n + 1 of m indices is excessively costly. 
Even a full search over all m — n remaining indices to select the next basis function 
can be prohibitively expensive. Here Theorem 6.33 comes to our aid — it states that 
with high probability, a small subset of size x = 59, chosen at random, guarantees 
near optimal performance. Hence, if we are satisfied with finding a relatively good 
index rather than the best index, we may resort to selecting only a random subset. 
It is now crucial to obtain the values of QAP Bopi) cheaply (with P = [Pua, eil), 
assuming that we found Pq previously. From (16.76) we can see that we need 
only do a rank-1 update on the inverse. We now show that this can be obtained 
in O(mn) operations, provided the inverse of the smaller subsystem is known. 
Expressing the relevant terms using Poia and k;, we obtain 


P'K'y = [Paa ei) 'K'y = (PaaK 'y,k/ y), (16.77) 
Pl. (K'K+o0°K Pola Pl, (K! +071 k; 
pt (KK 4 o°K) pe old ( ) o old ( ) l (16.78) 
k] (K + @°1)Paa k] k; + 0? Kii 
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Algorithm 16.1 Sparse Greedy Quadratic Minimization. 


Require: Training data X = {x,,...,xn}, Targets y, Noise o°, Precision €, corresponding 
quadratic forms Q and Q*. 
Initialize index sets I, I* = {1,..., m}; S, S* = Ó. 
repeat 
Choose M C I, M* C I*. 
Find argmin -m Q (IP, ilee: arg mins em Q* ([P*, eB): 
Move i from I to S, i* from I* to S*. 
Set P :=[P,e;], P* := [P*, e«]. 
until Q(PB,,.) + 0° Q*(PBep) + Slyll? < $(Q(PBop)| + lo? Q*(PBepe) + +y 
Output: Set of indices S, Bop (P1 KP), and (P'™(K'K+0°K)P)"". 


Thus computation of the terms costs only O(nm) once we know Pya. Further- 
more, we can write the inverse of a strictly positive definite matrix as 


A B 
B! C 


A-!+(A71B)'y(A7!B) —q(A71B) 
—(7(A71B))! J 


where y := (C — B! A~!B)-!. Hence, inversion of PT (K! K + 0o°K) P costs only 
O(n’). Thus, to find the matrix P of size m x n takes O(«xn?m) time. For the error 
bars, (P'KP)~' is generally a good starting value for the minimization of (16.73), 
so the typical cost for (16.73) is O(rmn) for some T < n, rather than O(mn°). 

If additional numerical stability is required, we might want to replace (16.79) by 
a rank-1 update rule for Cholesky decompositions of the corresponding positive 
definite matrix. Furthermore, we may want to add the kernel function chosen by 
positive diagonal pivoting [169] to the selected subset, in order to ensure that the 
n x n submatrix remains invertible. See numerical mathematics textbooks, such as 
[247], for more detail on update rules. 


(16.79) 


16.4.6 Hardness and Approximation Results 


It is worthwhile to study the theoretical guarantees on the performance of the 
algorithm (as described in Algorithm 16.1). It turns out that our technique closely 
resembles a Sparse Linear Approximation problem studied by Natarajan [381]: 

Given A € R"*", b € R", and e > 0, find x € R" with minimal number of nonzero 
entries such that ||Ax — b||2 < €. If we define 


A= (K $ KTK) * and b:= A7'Ky, (16.80) 
we may write 

1 
Qla) = 5 ||b - Aal? +c, (16.81) 


where c is a constant independent of œ. Thus the problem of sparse approximate 
minimization of Q(q) is a special case of Natarajan’s problem (where the matrix 
A is square, and strictly positive definite). In addition, the algorithm considered 
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in [381] involves sequentially choosing columns of A to maximally decrease ||Ax — 
b||. This is equivalent to the algorithm described above and we may apply the 
following result to our sparse greedy Gaussian process algorithm. 


Theorem 16.6 (Natarajan, 1995 [381]) A sparse greedy algorithm to approximately 
solve the problem 


minimize ||y — Ax|| (16.82) 
needs at most 
n < 18n*(e)||A*||5 In a (16.83) 


non-zero components, where n*(e) is the minimum number of nonzero components in 
vectors a for which ||y — Ax|| < €, and At is the matrix obtained from A by normalizing 
its columns to unit length. 


Corollary 16.7 (Approximation Rate for Gaussian Processes) Algorithm 16.1 sat- 
isfies Q(&) < Q(Qopt) + € when a has 


[A Ky 
B 
non-zero components, where n*(e) is the minimum number of nonzero components in 
vectors a for which Q(a) < Q(Qopt) + 2, A = (a? K + KTK)"?, and A is the smallest 
magnitude of the singular values of A, the matrix obtained by normalizing the columns of 
A. 


n< ae (16.84) 


Moreover, we can also show NP-hardness of sparse approximation of Gaussian 
process regression. The following theorem holds: 


Theorem 16.8 (NP-Hardness of Approximate GP Regression) There exist kernels 
K and labels y such that the problem of finding the minimal set of indices to minimize a 
corresponding quadratic function Q(a) with precision £ is NP-hard. 


Proof We use the hardness result [381, Theorem 1] for Natarajan’s quadratic 
approximation problem in terms of A and b. More specifically, we have to proceed 
in the opposite direction to (16.80) and (16.81) and show that for every A and b, 
there exist K and y for an equivalent optimization problem. 

Since ||Ax — b||? = x'(A'A)x — 2(b' A)x + ||b||?, the value of A enters only via 
A'A, which means that we have to find K in (16.68) such that 


A'A=K!'K+o0°K. (16.85) 


We can check that it is possible to find a suitable positive definite K for any A, 
by using identical eigensystems for A'A and K, and subsequently solving the 
equations a; = \? + o7A; for the respective eigenvalues a; and À; of ATA and K. 
Furthermore, we have to satisfy 


y'K=DA. (16.86) 
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Figure 16.6 Speed of Convergence. We plot the size of the gap between upper and lower 
bound of the log posterior (gap(a@, a*)), for the first 4000 samples from the Abalone dataset 
(a? = 0.1 and 2u” = 10). From top to bottom: Subsets of size 1, 2, 5, 10, 20, 50, 100, 200. The 
results were averaged over 10 runs. The relative variance of the gap size is less than 10%. 
We can see that subsets of size 50 and above ensure rapid convergence. 


To see this, recall that bA is a linear combination of the nonzero eigenvectors of 
ATA; and since K has the same rank and image as ATA, the vector bA can also 
be represented by y' K. Thus for every A,b there exists an equivalent Q, which 
proves NP-hardness by reduction. a 


This shows that the sparse greedy algorithm is an efficient approximate solution 
of an NP-hard problem. 


16.4.7 Experimental Evidence 


We conclude this section with a brief experimental demonstration of the efficiency 
of sparse greedy approximation methods, using the Abalone dataset. Specifically, 
we used Gaussian covariance kernels, and we split the data into 4000 training and 
177 test examples to assess training speed (to assess generalization performance, a 
3000 training and 1177 test set split was used). 
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Table 16.2 Performance of sparse greedy approximation vs. explicit solution of the full 
learning problem. In these experiments, the Abalone dataset was split into 3000 training 
and 1177 test samples. To obtain more reliable estimates, the algorithm was run over 10 
random splits of the whole dataset. 


Po Generalization Error | Log Posterior 


Optimal Solution 1.782 + 0.33 —1.571-10°(1 40.005) 
Sparse Greedy Approximation | 1.785 +0.32 —1.572-10°(1 + 0.005) 


Table 16.3 Number of basis functions needed to minimize the log posterior on the 
Abalone dataset (4000 training samples), for various kernel widths w. Also given is the 
number of basis functions required to approximate k'(K + 071)-'k, which is needed to 
compute the error bars. Results were averaged over the 177 test samples. 


EZ 


5 


For the optimal parameters (207 = 0.1 and 2w? = 10, chosen after [513]), the 
average test error of the sparse greedy approximation trained until gap(a, a*) < 
0.025 is indistinguishable from the corresponding error obtained by an exact so- 
lution of the full system. The same applies for the log posterior. See Table 16.2 for 
details. Consequently for all practical purposes, full inversion of the covariance 
matrix and the sparse greedy approximation have comparable generalization per- 
formance. 

A more important quantity in practice is the number basis functions needed to 
minimize the log posterior to a sufficiently high precision. Table 16.3 shows this 
number for a precision of gap(a@, a*) < 0.025, and its variation as a function of the 
kernel width g; the latter dependency is observed since the number of kernels 
determines time and memory needed for prediction and training. In all cases, 
less than 10% of the kernel functions suffice to find a good minimizer of the log 
posterior; less than 2% are sufficient to compute the error bars. This is a significant 
improvement over a direct minimization approach. 

A similar result can be obtained on larger datasets. To illustrate, we generated 
a synthetic data set of size 10.000 in R”? by adding normal noise with variance 
g? = 0.1 to a function consisting of 200 randomly chosen Gaussians of width 
2w? = 40 and with normally distributed expansion coefficients and centers. 

To avoid trivial sparse expansions, we deliberately used an inadequate Gaussian 
process prior (but correct noise level) consisting of Gaussians with width 207 = 
10. After 500 iterations (thus, after using 5% of all basis functions), the size of 
gap(a, a*) was less than 0.023. This demonstrates the feasibility of the sparse 
greedy approach on larger datasets. 
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16.5 Laplacian Processes 


All the prior distributions considered so far are data independent priors; in other 
words, p(f) does not depend on X at all. This may not always be the most desirable 
choice, thus we now consider data dependent priors distributions, p(f|X). This goes 
slightly beyond the commonly used concepts in Bayesian estimation. 

Before we go into the technical details, let us give some motivation as to why 
the complexity of an estimate can depend on the locations where data occurs, since 
we are effectively updating our prior assumptions about f after observing the data 
placement. Note that we do not modify our prior assumptions based on the targets 
Yi, but rather as a result of the distribution of patterns x;: Different input distribu- 
tion densities might for instance correspond to different assumptions regarding 
the smoothness of the function class to be estimated. For example, it might be 
be advisable to favor smooth functions in areas where data are scarce, and allow 
more complicated functions where observations abound. We might not care about 
smoothness at all in regions where there is little or no chance of patterns occurring: 
In the problem of handwritten digit recognition, we do not (and should not) care 
about the behavior of the estimator on inputs x looking like faces. 

Finally, we might assume a specific distribution of the coefficients of a function 
via a data-dependent function expansion; in other words, an expansion of f into 
the span of ® := {¢1,...,¢m}, where ¢; are functions of the observed data X and 
of x. We focus henceforth on the case where M = m and ¢;(x) := k(x;, x). 

The specific benefit of this strategy is that it provides us with a correspondence 
between linear programming regularization (Section 4.9.2) and weight decay reg- 
ularizers (Section 4.9.1), and Bayesian priors over function spaces, by analogy to 
regularization in Reproducing Kernel Hilbert Spaces and Gaussian Processes. !9 


16.5.1 Data Dependent Priors 
Recall the reasoning of Section 16.1.3. We obtained (16.11) under the assumption 


that X and f are independent random variables. In the following, we repeat the 
derivation without this restriction, and obtain 


PYF, X)PFIX) = pY, FIX), (16.87) 
and likewise, 
PEIX, DPIX) = py, FIX). (16.88) 


Combining these two equations provides us with a modified version of Bayes’ 
rule, which after solving for p(f|Y, X), reads 


PVF, OPEX) = PEIX, Yp), (16.89) 


10. We thank Carl Magnus Rasmussen for discussions and suggestions. 
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and thus, 


PVF, XpEIX) 

X, Y) = ———_.. 16.90 
Since p(Y|X) is independent of f, we may treat p(Y|X) as a mere normalization 
factor, and focus on p(Y|f, X)p(f|X) for inference purposes. Let us now study a 
specific class of such priors, which are best formulated in coefficient space. We 
have 


p(f|X) x OP (- 5 1e) , where f(x) = 5 ajk(x;, x). (16.91) 
il i=l 


Here Z is the corresponding normalization term and x; € X. Examples of priors 
that depend on the locations x; include 


4(a) =1—e?'¢l with p > 0 (feature selection prior), (16.92) 
7(a) = a? (weight decay prior), (16.93) 
+(@) = |a| (Laplacian prior). (16.94) 


The prior given by (16.92) was introduced in [70, 187] and is concave. While the 
latter characteristic is unfavorable in general, since the corresponding optimiza- 
tion problem exhibits many local minima, the regularized risk functional becomes 
strictly concave if we choose linear loss functions (such as the L; loss or the soft 
margin). According to Theorem 6.12, this means that the optimum occurs at one 
of the extreme points, which makes optimization more feasible. 

Eq. (16.93) describes the popular weight decay prior used in Bayesian Neural Net- 
works [338, 382, 383]. It assumes that the coefficients are independently normally 
distributed. We relax the assumption of a common normal distribution in Sec- 
tion 16.6 and introduce individual (hyper)parameters s;. The resulting prior, 


p(f|X,s) = (27)? (fi) exp (-} Sst) t (16.95) 
i=1 i=1 


leads to the construction of the Relevance Vector Machine [539] and very sparse 
function expansions. 

Finally, the assumption underlying the Laplacian prior (16.94) is that only very 
few basis functions will be nonzero. The specific form of the prior is why we will 
call such estimators Laplacian Processes. This prior has two significant advantages 
over (16.92): It leads to convex optimization problems, and the integral f p(a)da is 
finite and thus allows normalization (this is not the case for (16.92), which is why 
we call the latter an improper prior). 

The Laplacian prior corresponds to the regularization functional employed in 
sparse coding approaches, such as wavelet dictionaries [104], coding of natural 
images [389], independent component analysis [327], and linear programming 
regression [502, 517]. 
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In the following, we focus on (16.94). It is straightforward to see that the MAP 
estimate can be obtained by minimizing the negative log posterior, which is given 
(up to constant terms) by 


m 


-$ In p(yilf(xi), x) + X, laid. (16.96) 
i=1 i=1 


Depending on In p(yi|f(x;), xi), we may formulate (16.96) as a linear or quadratic 
program. 


16.5.2 Samples from the Prior 


In order to illustrate our reasoning, and to show that such priors correspond 
to useful classes of functions, we generate samples from the prior distribution. 
As in Gaussian processes, smooth kernels k correspond to smooth priors. This 
is not surprising: As we show in the next section (Theorem 16.9), there exists a 
corresponding Gaussian process for every kernel k and every distribution p(x). 

The obvious advantage, however, is that we do not have to worry about Mer- 
cer’s condition for k but can take any arbitrary function k(x, x’) to generate a Lapla- 
cian process. We draw samples from the following three kernels, 


ix—x! 2 


ka) = e@ a Gaussian RBF kernel, (16.97) 
k(x,x)- eH Laplacian RBF kernel, (16.98) 
k(x, x’) = tanh(6(x, x’) + V) Neural Networks kernel. (16.99) 


While (16.97) and (16.98) are also valid kernels for Gaussian Process estimation, 
(16.99) does not satisfy Mercer’s condition (see Section 4.6 for details) and thus 
cannot be used in Gaussian processes!". Figure 16.7 gives sample realizations from 
the corresponding process. The use of (16.99) is impossible for GP priors, unless 
we diagonalize the matrix K explicitly and render it positive definite by replacing 
A; with |A;|. This is a very costly procedure (see also [480, 210]) as it involves 
computing the eigensystem of K. 


16.5.3 Prediction 


Since one of the aims of using a Laplacian prior on the coefficients a; is to achieve 
sparsity of the expansion, it does not appear sensible to use a Bayesian averaging 
scheme (as in Section 16.1.3) to compute the mean of the posterior distribution, 
since such a scheme leads to mostly nonzero coefficients. Instead we seek to obtain 
the mode of the distribution (the MAP estimate), as described in Section 16.2.1. 


11. The covariance matrix K has to be positive definite at all times. An analogous applica- 
tion of the theory of conditionally positive definite kernels would be possible as well, as 
pointed out in Section 2.4. We would simply assume a Gaussian Process prior on a linear 
subspace of the yj. 
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Figure 16.7 Left Column: Grayscale plots of the realizations of several Laplacian Pro- 
cesses. The black dots represent data points. Right Column: 3D plots of the same samples of 
the process. We used 400 data points sampled at random from [0,1]? using a uniform distri- 
bution. Top to bottom: Gaussian kernel (16.97) (o? = 0.1), Laplacian kernel (16.98) (o = 0.1), 
and Neural Networks kernel (16.99) (8 = 10,9 = 1). Note that the Laplacian kernel is sig- 
nificantly less smooth than the Gaussian kernel, as with a Gaussian Process with Laplacian 
kernels. Moreover, observe that the Neural Networks kernel corresponds to a nonstationary 
process; that is, its covariance properties are not translation invariant. 
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Unlike in Gaussian Process regression, this does not give an exact solution of 
the problem of finding the mean, since mode and mean do not coincide for Lapla- 
cian regularization (recall Figure 16.2). Nonetheless, the MAP estimate is compu- 
tationally attractive, since if both — In p(&;) and 7(a;) are convex, the optimization 
problem has a unique minimum. 

The method closely follows our reasoning in the case of Gaussian Processes. 
Recall that the posterior probability of a hypothesis (16.23) is proportional to 


PEIX, Y) x pV If, X)p(f|X) (16.100) 


= Ë pyi — f(xi))| PEIX) (16.101) 
i=1 


= ie (r =>}, os) aad (16.102) 
2 


To obtain (16.101), we exploit the deterministic dependency t; = f(x;). The latter 
allows us to state p(Y|f,X) explicitly by integrating out the random variables 
& =yi- t;.12 To carry out inference we write the problem of finding the MAP 
estimate of (16.100) as an optimization problem and obtain 


m 


minimize —In p(£) + Qi), 
2 nee 27 (16.103) 


subjectto Ka=y+6&, 


where a, € € R” and Kj; = k(x;,x;) as usual. For p(&;) = |&;| and y(a;) = |ai|, this 
leads to a linear program (see Section 4.9.2), and the solution can be readily used as 
a MAP estimate for Laplacian processes (a similar reasoning holds for soft margin 
loss functions). Likewise for Gaussian noise, we obtain a quadratic program with a 
simple objective function but a dense set of constraints, by analogy to Basis Pursuit 
[104]. The derivation is straightforward; see also Problem 16.17 for details. 


16.5.4 Confidence Intervals for Gaussian Noise 


One of the key advantages of Bayesian modelling is that we can obtain explicit 
confidence intervals for the predictions, provided the assumptions made regard- 
ing the priors and distribution are satisfied. Even for Gaussian noise, however, 
no explicit meaningful expansion using the MAP estimate aap is possible, since 
(ai) = |a;| is non-differentiable at 0 (otherwise we could make a quadratic ap- 
proximation at a; = 0). Nonetheless, a slight modification permits computation- 
ally efficient approximation of such error bounds. 

The modification consists of dropping all variables a; for which ayap,; = 0 from 
the expansion (this renders the distribution flatter and thereby overestimates the 


12. For the purpose of minimizing (16.100), it is sometimes convenient to keep the €;, which 
then serve as slack variables in the convex optimization problem. 
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error), and replacing all remaining variables by linear approximations (we replace 
Jail by sgn (amap,i) ai). 

In other words, we assume that variables with zero coefficients do not influence 
the expansion, and that the signs of the remaining variables do not change. This is 
a sensible approximation since for large sample sizes, which Laplacian processes 
are designed to address, the posterior is strongly peaked around its mode [519]. 
Thus the contribution of ||a||; around amar can be considered to be approximately 
linear. 

Denote by ay the vector of nonzero variables, obtained by deleting all entries 
where amar; = 0, by s the vector with elements +1 such that ||amap||1 = $s! am, 
and by Ky, the matrix generated from K by removing the columns corresponding 
to amari = 0. Then the posterior (now written in terms of œ for convenience) can 
be approximated as 


p(am|Z) x exp (-= Di — Kuan?) exp (-sTam) : (16.104) 


i=1 


Collecting linear and quadratic terms, we see that 
am ~ N(amap, (KyKm)~'), where amar = (KyKu)~ (Ky + 078). (16.105) 


The equation for a@map follows from the conditions on the optimal solution of the 
quadratic programming problem (16.103), or directly from maximizing (16.100) 
(after s is fixed). Hence predictions at a new point x are approximately normally 
distributed, with 


-1 
y(x) =N (rtan, G +k} (KiKu) ks) ) ; (16.106) 


where ky := (k(x1, x),..-,k(xm, x)) and only x; with nonzero aap, are considered 
(thus M < m). The additional a? stems from the fact that we have additive Gaus- 
sian noise of variance g? in addition to the Laplacian process. Equation (16.106) 
is still expensive to compute, but it is much cheaper to invert £],Zm than a dense 
square matrix È (since @ap may be very sparse). In addition, greedy approxi- 
mation methods (as described for instance in Section 16.4.4) or column generation 
techniques [39] could be used to render the computation of (16.106) numerically 
more efficient. 


16.5.5 Data Independent Formulation 
While (16.91) gives a very natural description of the behavior of the estimator, it is 
possible in the case of (16.91) to find an equivalent, albeit much less elegant, data 


independent formulation. Denote by K the standard kernel matrix (Kj; = k(x;, x;)) 
and by [Ky]; the ith entry of the vector K~'y. Then we may write p(f|X) as 


p(y) = SOP (- 5 q ([K"'y] )) ; (16.107) 
i=1 
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This can be seen as follows: if K has full rank, setting y = Ka yields K~!y = 
K"Ka=a. 

As an aside, note that some priors, such as (16.94) and (16.93), can also be 
interpreted as changes in the metric given by Gaussian processes. Recall that the 
latter can be stated as 


1 1 
p(y) x exp Gils (16.108) 


By changing the metric tensor from K~? to K~!, we recover (16.93). Replacing the 
||- ||2 norm by ||- ||; yields (16.94). This formulation no longer depends explicitly on 
x. Nonetheless, the data dependent notation is much more natural and provides 
more insight into the inner workings of the estimator. 


16.5.6 An Equivalent Gaussian Process 


We conclude this section with a proof that in the large sample size limit, there 
exists a Gaussian Process for each kernel expansion with a prior on the coefficients 
aj. For the purpose of the proof, we have to slightly modify the normalization 
condition on f: That is, we assume 


m 


y(x) = 1/fm > aik(xi, x), (16.109) 
i=l 


where a; ~ exp(—7(a@)). In the limit m — oo, the following theorem holds. 


Theorem 16.9 (Convergence to Gaussian Process) Denote by a; independent ran- 
dom variables (we do not require identical distributions on a;i) with unit variance and 
zero mean. Furthermore, assume that there exists a distribution p(x) on X according to 
which a sample {x1,...,Xm} is drawn, and that k(x, x’) is bounded on X x X. Then the 
random variable y(x) given by (16.109) converges for m — co to a Gaussian process with 
zero mean and covariance function 


k(x, x’) = L k(x, k(x, Ipd. (16.110) 


This means that instead of a Laplacian process prior, we could use any other 
factorizing prior on the expansion coefficients a; and in the limit still obtain an 
equivalent stochastic process. 


Proof To prove the first part, we need only check is that y(x) and any linear 
combination >; y(x;) (for arbitrary x’ € X) converge to a normal distribution. By 
application of a theorem of Cramér [118], this is sufficient to prove that y(x) is 
distributed according to a Gaussian Process. 

The random variable y(x) is a sum of m independent random variables with 
bounded variance (since k(x, x’) is bounded on X x X). Therefore in the limit m —> 
co, by virtue of the Central Limit Theorem (e.g., [118]), we have y(x) ~ N(0, (x) 
for some o*(x) € R. For arbitrary x’ € X, linear combinations of y(x/) also have 
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Gaussian distributions since 


n Ri 1 m l n l l , 
2, bye) Ja yD, Pika), (16.111) 
which allows the application of the Central Limit Theorem to the sum since 
the inner sum Xi Bik(xi, x) is bounded for any x;. This theorem also implies 

fat Bjy(x') ~ NO, 0?) for m — oo and some o? € R, which proves that y(x) is 
distributed according to a Gaussian Process. 

To show (16.110), first note that y(x) has zero mean. Thus the covariance function 
for finite m can be found as expectation with respect to the random variables aj, 


m 


1 1 m 
= 2, eee) = 2 Meee?) (16.112) 
i j= iS 


Ely(x)y(x')] = E 


since the a; are independent and have zero mean. This expression, however, 
converges to the Riemann integral over X with the density p(x) as m — oo. Thus 


EIYE > kx, DK, DPs, (16.113) 
m—> 00 x 
which completes the proof. a 
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Recently, Tipping [539] proposed a method to obtain sparse solutions for regres- 
sion and classification while maintaining their Bayesian interpretability. The basic 
idea is to make extensive use of hyperparameters in determining the priors p(a;) 
on the individual expansion coefficients aj. 

In particular, we assume a normal distribution over a; with adjustable variance. 
The latter is then determined with a hyperparameter that has its most likely value 
at 0; this leads to a concentration of the distribution of a; around 0. This prior is 
expressed analytically as 


1 
p(ailsi) = 4/ = exp (-3s?) ; (16.114) 


where s; > 0 plays the role of a hyperparameter with corresponding hyperprior 


1 
psi) = z (this is a flat hyperprior on a log scale: p(Ins;) =const.), or (16.115) 
p(si) = T(s;|a, b). (16.116) 
The Gamma distribution is given by 


s*—!b*exp(—s;b) 


T(s;|a, b) := T@ 


for s; > 0. (16.117) 


For non-informative (flat) priors, we typically choose a = b = 10~* (see [539]). Note 
that (16.117) is heavily peaked for s; — 0. For regression, a similar assumption is 
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made concerning the amount of additive Gaussian noise o7; thus p(a*) = 1/a? or 
p(a*) =1(6?|c,d) where typically c = d = 10~*. Note that the priors are imposed 
on the inverse of a and G. 

To explain the method in more detail, we begin with a description of the steps re- 
quired for regression estimation. The corresponding reasoning for classification is 
relegated to Section 16.6.4, since it uses methods similar to those already described 
in previous sections. 


16.6.1 Regression with Hyperparameters 


For the sake of simplicity, we assume additive Gaussian noise. Hence, given a 
kernel expansion t = Ka, we have 


-u 1 
piyla, o? = 2703 exp ( -zall - Kall?) (16.118) 
With the definition S := diag(s1, . . .Sm), we obtain 
m 1 1 
p(@|s) = (2r)? |S|? exp (-3a"sa) (16.119) 


from (16.114). Since p(y|a, c°) and p(a|s) are both Gaussian, we may integrate out 
a to s (for proper normalizations) and obtain explicit expressions for the condi- 
tional distributions of a and s. In particular, since p(alt,s, o°) x p(t|a, o*)p(als), 
then using (16.119) we get 


eel 1 Y; 
p(aly,s, 0°) = (21)? |2|"? exp (-z - p) 'E™ (a — m) ; (16.120) 
where 
£= (o “KK + 5)! and u = o “DKy, (16.121) 


Additionally, note that p(y|s,o?) is a convolution of two normal distributions, 
namely p(y|a, o°) and p(a|s), hence the corresponding variances add up and we 
obtain 


p(y|s,o7) = 7 p(y|a, o°)p(als)da (16.122) 
= (2n)~# |E|77 exp (-3y7-y) l (16.123) 


where È = 071+ KS~!K'. Eq. (16.123) is useful since it allows us to maximize the 
posterior probability of s provided we know y and 0°. This leads to 


p(sly,o7) x ply|s, o*)p(s). (16.124) 


In order to carry out Bayesian inference, we would have to compute 


pluly)= f plyla,s,0°)pla,s,0°lyydaudsdo®. (16.125) 
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In most cases, however, this integral is intractable. Under the assumption that 
p(y|s, o*)p(s)p(o?) is peaked around its mode, we may use the MAP2 approxi- 
mation (16.24) and obtain 


puly) © fria, Smar, Omap)P(Qly, Smar, Omap)da. (16.126) 


We cover the issue of finding optimal hyperparameters in the next section. For the 
moment, however, let us assume that we know the values of smar and o{;,,p. 

Since the integral in (16.126) can be seen as the convolution of two normal 
distributions, we may solve the equations explicitly to obtain 


PULY, Smar, amar) ~ N(x; 02); (16.127) 
where, using the definition of X (16.121), 

Ys = OpApk ky and a, =o +k' Ek. (16.128) 
Note the similarity of (16.128) to (16.49) for Gaussian Processes. 


16.6.2 Finding Optimal Hyperparameters 


According to [539], the optimal parameters s and g? cannot be obtained in closed 
form from 


(SMAP, Treas) = argmin [- In p(y|s, o°) — In p(s) — In p(o’°)] ; (16.129) 
(s,07) 

A possible solution, however, is to perform gradient descent on the objective 

function (16.129). Taking logs of the Gamma distribution (16.117) and substituting 

the explicit terms for p(y|s, a7) yields the following expression for the argument of 

(16.129); 


P = —Inp(y|s, o°) — In p(s) — In p(o”) (16.130) 
-1 
= ; [m [o1 + KSK"] +y! (01 + KSK") y 


m 


(alns; — bsi) — cln o° + do”. (16.131) 


i=1 

Of course, if we set a = b = c = d = Q (flat prior) the terms in (16.131) vanish. 
Note the similarity to logarithmic barrier methods in constrained optimization, 
for which constrained minimization problems are transformed into unconstrained 
problems by adding logarithms of the constraints to the initial objective func- 
tion (see Chapter 6, and in particular (6.90), for more detail). In other words, the 
Gamma distribution can be viewed as a positivity constraint on the hyperparam- 
eters s; and a”. 
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Differentiating (16.131) and setting the corresponding terms to 0 leads to the 
update rules (see [539] and Problem 16.20 for more details) 
1—s Di 
Me 
where we use the definitions in (16.121). The quantity 1 — s;X;; is a measure of the 


degree to which the corresponding parameter a; is “determined” by the data [338]. 
Likewise we obtain 


Si = 


; (16.132) 


= 2 
o2 — Wy Kall (16.133) 


It turns out that many of the parameters s; tend to infinity during the optimization 
process. This means that the corresponding distribution of a; is strongly peaked 
around 0, and we may drop these variables from the optimization process. This 
speeds up the process as the minimization progresses. 

It seems wasteful to first consider the full set of possible functions k(x;, x), and 
only then weed out the functions not needed for prediction. We could instead 
use a greedy method for building up predictors, similar to the greedy strategy 
employed in Gaussian Processes (Section 16.4.4). This is the approach in [539], 
which proposes the following algorithm. After initializing the predictor with a 
single basis function (the bias, for example), we test whether each new basis 
function yields an improvement. This is done by guessing a large initial value s;, 
and performing one update step. If (16.132) leads to an increase of s;, we reject the 
corresponding basis function, otherwise we retain it in the optimization process. 


16.6.3 Explicit Priors by Integration 


A second way to perform inference while circumventing the MAP2 estimate is 
to integrate out the hyperparameters s; and then deal with p(a;) in a standard 
fashion. In the present case, integration can be carried out in closed form over the 
hyperprior. We obtain 


plai) = f p(ails;)p(sila, b)ds; o (0 + S 7 ‘ (16.134) 


which is a Student-t distribution over aj. In other words, the effective prior on a; 
is given by (16.91) with 


Yai) = (« + 5) In ( + 2 ; (16.135) 


or after reparametrization y(a;) = a’In(1 + b'a?) for suitably chosen a’,b’ > 0. 
This connects the Relevance Vector Machine to other methods that encode priors 
directly in coefficient space without the aid of a hyperparameter. We can see that 
(16.135) is heavily peaked at a; = 0, which explains why most of the parameters 
are 0. 
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Unfortunately (16.135) proves to be unsuited to implementation, since the poste- 
rior probability exhibits “horribly” [539] many local minima. Hence, estimates de- 
rived from the optimization of the log posterior are not particularly meaningful. 
Nonetheless, this alternative representation demonstrates a connection between 
Gaussian Processes and Relevance Vector Machines, based on Theorem 16.9: In 
the large sample size limit, Relevance Vector Machines converge to a Gaussian 
Process with kernel given by (16.110). 


16.6.4 Classification 


For classification, we follow a scheme similar to that in Section 16.3.5. In order to 
keep matters simple, we only consider the binary classification case. Specifically, 
we carry out logistic regression by using (16.3) as a model for the distribution of 
labels y; € {+1}. As in regression, we use a kernel expansion, this time for the 
latent variables t = Ka. The negative log posterior is given by 


m m 


—In p(aly,s) = © —Inp(yilt(xi)) — Yn p(ails;) + const. (16.136) 
Unlike in regression, however, we cannot minimize (16.136) explicitly and have 
to resort to approximate methods, such as the Laplace approximation (see Sec- 
tion 16.4.1). Computing the first and second derivatives of (16.136) and using the 
definitions (16.59) and (16.60) yields 


ða [—In p(aly,s)] = Ke + Sa, (16.137) 
aZ [—In p(aly,s)] = K'CK + S. (16.138) 


This allows us to obtain a MAP estimate of p(aly,s) by iterative application of 
(16.58), and we obtain an update rule for a in a manner analogous to (16.61); 


Qnew = Qold — (K'CK + S)71(Ke + SQioig) = (K'CK + S)7! K(CKatgig — c). (16.139) 


If the iteration scheme converges, it will converge to the minimum of the negative 
log posterior. We next have to provide an iterative method for updating the hy- 
perparameters s (note that we do not need a). Since we cannot integrate out a 
explicitly (we had to resort to an iterative method even to obtain the mode of the 
distribution), it is best to use the Gaussian approximation obtained from (16.138). 
This gives an approximation of the value of the posterior distribution p(s|y) and 
allows us to apply the update rules developed for regression in classification. Set- 
ting u = amar and È = (K'CK + S)~!, we can use (16.132) to optimize s;. See [539] 
for further detail and motivation. 


16.6.5 Toy Example and Discussion 


We conclude this brief description of RVMs with a toy example (Figure 16.8), 
taken from [539], in which regression is performed on a noisy sinc function (cf. 
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Figure 16.8 Support (left) and Relevance (right) Vector approximations to sinc (x), based 
on 100 noisy samples. The estimated functions are drawn as solid lines, the true function in 
gray, and Support/Relevance Vectors are circled. 


Chapter 9) using RV and SV estimators. In both cases, a linear spline kernel [572], 


PE iat. pay, (16.140) 


was used. The noise, added to the y-values, was uniformly distributed over 
[—0.2, 0.2]. 

In the SV case, all points with a distance > € (not shown) become SVs; this 
leads to an expansion that is not particularly sparse. The RVM, on the other hand, 
constructs a solution which is not constrained to use these points, and delivers a 
much sparser solution. 

It should be noted that similarly sparse solutions can be obtained for SVMs by 
using “reduced set” post-processing (Chapter 18). Although this step adds to the 
training time, SVMs train far more quickly than RVMs on large data sets (of the 
order of thousands of examples), and the added post-processing time is then often 
negligible. Nevertheless, it is fair to say that the RVM is an elegant and principled 
Bayesian alternative to SVMs. 


K(x, x’) = xx’ + xx’ min{x, x} — 2 


In this chapter, we presented an overview over some of the more common tech- 
niques of Bayesian estimation, namely Gaussian Processes and the Relevance Vec- 
tor Machine, and a novel method: Laplacian Processes. Due to the wealth of ex- 
isting concepts and algorithms developed in Bayesian statistics, it is impossible to 
give a comprehensive treatment in a single chapter. Such a treatise would easily 
fill a whole book in its own right. 
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16.7.1 Topics We Left Out 


We did not discuss Markov-Chain Monte Carlo methods and their application 
to Bayesian Estimation [382, 426] as an alternate way of performing Bayesian 
inference. These work by sampling from the posterior distribution rather than 
computing an approximation of the mode. 

On the model side, the maximum entropy discrimination paradigm [499, 114, 
261] is a worthy concept in its own right, powerful enough to spawn a whole fam- 
ily of new inference algorithms both with [261] and without [263] kernels. The 
main idea underlying this concept is to seek the least informative estimate for pre- 
diction purposes. In addition, rather than requiring that a specific function satisfy 
certain constraints, we require only that the distribution satisfy the constraints on 
average. 

Methods such as the Bayes-Point machine [239] and Kernel Billiard [450, 451] 
can also be used for estimation purposes. The idea behind these methods is to 
“play billiard” in version space (see Problem 7.21) and average over the existing 
trajectories. The version space is the set of all w for which the empirical risk 
Remp[w] vanishes or is bounded by some previously chosen constant. Proponents 
of this strategy claim rapid convergence due to the good mixing properties of the 
dynamical system. 

Finally, we left the field of graphical models (see for instance [525, 260, 274, 273] 
and the references therein) completely untouched. These algorithms model the de- 
pendency structure between different random variables in a rather explicit fashion 
and use efficient approximate inference techniques to solve the optimization prob- 
lems. To date it is not clear how such methods can be combined with kernels. 


16.7.2 Key Issues 


Topics covered in the chapter include deterministic and approximate methods 
for Bayesian inference, with an emphasis on the Maximum a Posteriori (MAP) 
estimate and the treatment of hyperparameters. As a side-effect, we observe that 
the minimization of regularized risk is closely related to approximate Bayesian 
estimation. 

One of the first consequences of this link is the connection between Gaussian 
Processes and Support Vector Machines. While the former are defined in terms 
of correlations between random variables, the latter are derived from smoothness 
assumptions regarding the estimate and feature space considerations. This con- 
nection also allows us to exchange uniform convergence statements and Bayesian 
error bounds between both types of reasoning. 

As a side effect, this connection also gives rise to a new class of prior, namely 
those corresponding to 4 regularization and linear programming machines. Since 
the coefficients a; then follow a Laplacian distribution, we name the correspond- 
ing stochastic process a Laplacian Process. This new point of view allows the deriva- 
tion of error bars for the estimates in a way that is not easily possible in a statistical 
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learning theory framework. It turns out that this leads to a data dependent prior 
on the function space. 

Finally, the Relevance Vector Machine introduces individual hyperparameters 
for the distributions of the coefficients a;. This makes certain optimization prob- 
lems tractable (matrix inversion) that otherwise would have remained infeasible 
(MAP estimate with the Student-t distribution as a prior). We expect that the tech- 
nique of representing complex distributions by a normal distribution cum hyper- 
prior is also a promising approach for other estimation problems. 

Taking a more abstract view, we expect a convergence between different estima- 
tion algorithms and inference principles derived from risk minimization, Bayesian 
estimation, and Minimum Description Length concepts. Laplacian Processes and 
the Relevance Vector Machine are two examples of such convergence. We hope 
that more such methods will follow in the next few years. 


16.8 Problems 


16.1 (Prior Distributions e) Compute the log prior probability according to (16.10) for 
the following functions: sin x, sin x + 0.1 sin 10x, sin x + 0.01 sin 100x, and more gener- 
ally, f,(x) := sin x + 4sinnx on [—1, 7]. Show that the series fa converges to sin x, yet 
that f'(x) does not converge to cos x. Interpret this result in terms of prior distributions 
(Hint: What can you say about functions where the prior probability also converges). 


16.2 (Hypothesis Testing and Tail Bounds ee) Assume we want to test whether a 
coin produces equal numbers of heads and tails. Compute the likelihood that among m 
trials we observe my, heads and m; = m — my, tails, given a probability īm, that a head is 
observed (Hint: Use the binomial distribution and, if necessary, its approximation with a 
normal distribution). 

Next, compute the posterior probability for the following two prior assumptions on the 
possible values of Tp: 


p(t) = 1 (flat prior) , (16.141) 
p(n) = 127),(1 — Ta). (16.142) 


Give an interpretation of (16.141) and (16.142). What is the minimum number of coins 
we need to toss (assuming that we get an equal number of heads and tails) in order to state 
that the probability of heads equals that of tails within precision € with 1 — n probability? 

How many tosses do you need on average to detect a faulty coin that generates heads 
with probability Ty? 


16.3 (Label Noise e) Assume that we have a random variable y with P(y = 1) = p, and 
consequently P(y = —1) = 1 — p. What is the probability of observing y = 1 if we flip 
each label with probability n? What is P(y = 1) if we randomly assign a label for all y 
with probability n? 
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16.4 (Projected Normal Distributions ee) Assume that we have two clouds of points, 
one belonging to class 1 and a second belonging to class —1, and both normally distributed 
with unit variance. 

Show that for any projection onto a line, the distributions of the points on the line is 
a mixture of two normal distributions. What happens to the means and the variances? 
Formulate the likelihood that an arbitrary point on the line belongs to class 1 or —1. 


16.5 (Entropic Priors e) Assume we have an experiment with n possible outcomes where 


we would like to estimate the probabilities 7, ..., Tu with which these outcomes occur. We 
use the following prior distribution p(m,..-, Tn), 

n 
—Inp(m,...,7) = —H(m,..., nn) +c = X a lna7; +c, (16.143) 


i=1 
where c is a normalization constant and H denotes the entropy of the probabilities 
Tis- - -3 Tn. 

Show that (16.143) describes a proper prior distribution. Compute the likelihood of 
observing the outcomes 1, ..., n at times mı, . . . , Mn (see Problem 16.2). Derive the value 
of the log posterior distribution (use Gaussian approximation). Does the normalization 
constant c matter? What happens if we rescale (16.143) by a constant s? How does the 
log posterior change? Give examples of how s can be adjusted automatically (automatic 
relevance determination) and formulate the MAP2 estimate (eee). 


16.6 (Inference and Variance e) For a normal distribution in two variables with 


1 0.7 
K= Ma (16.144) 
0.75 0.75 


as covariance and zero mean, compute the variance in terms of the first variable if the 
second one is observed, and vice versa. 


16.7 (Samples from a Gaussian Process Prior e) Draw a sample X at random from 
the uniform distribution on [0,1]? and compute the corresponding covariance matrix 
K. Use for instance the linear kernel k(x,x’) = (x,x') and the Gaussian RBF kernel 
k(x, x!) = exp(- zle — x"). 

Write a program which draws samples uniformly from the normal distribution N(0, K) 
(Hint: Compute the eigenvectors of K first). What difference do you observe when using 
different kernels? 


16.8 (Time Series and Autocorrelation e) Assume a time series of normally distributed 
random variables €, drawn from a stationary distribution. Why is the autocorrelation func- 
tion independent of time? Show that the random variables £, follow a Gaussian Process. 
What is the covariance kernel? 


16.9 (Gaussian Processes with Roundoff Noise e) Give an expression for the poste- 
rior probability of a Gaussian Process with covariance function k(x, x’) in the presence of 
roundoff noise (see (16.45)). 
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16.10 (Hyperparameter Updates eee) Assume that k(x, x’) also depends on a parame- 
ter w. Compute the derivative of the log posterior for GP regression with respect to w (see 
[197, 198] for details). 

Can you adapt the sparse greedy approximation scheme (Section 16.4.4) to maximize 
the log posterior with respect to the hyperparameters (000)? 


16.11 (Convergence of Laplace Approximation 000) Find a lower bound on the ra- 
dius of convergence of the Laplace Approximation (Section 16.4.1). Hint: show that the 
iteration step of the Laplace approximation is a contraction. 


16.12 (Laplace Approximation in Function Space 000) Instead of formulating the 
Newton approximation steps in coefficient space, as done in (16.61), we may also de- 
rive the update rule in function space, which promises better convergence properties and 
a numerically less expensive implementation (see Section 10.6.1). 

Hint: Compute the gradient and the Hessian (Note: This is an operator in the present 
case) for the log posterior p(f|X, Y) of a Gaussian Process. Show that we can still invert 
the Hessian efficiently since it is simply a projection operator on the subspace spanned by 
k(x;, -). State the update rule. 


16.13 (Upper and Lower Bounds on Convex Functionals eee) Prove (16.65) and 
(16.66). Hint: For the lower bound, exploit convexity of the logistic function and con- 
struct a tangent. For the upper bound, construct a quadratic function with curvature 
larger than the logistic. 

Show that (16.65) and (16.66) are tight. What can you say in more general quadratic 
cases? See [264] for a detailed derivation. 


16.14 (Eigenfunctions of k ee) 


= How do the eigenfunctions of Figure 16.4 change if w changes? What happens to the 
eigenvalues? Confirm this behavior using numerical simulations. 


= What do you expect if the dimensionality of x increases? What if it increases signifi- 
cantly? What happens if we replace the Gaussian kernel by a Laplacian kernel? 


= Design an approximate training algorithm for Gaussian Processes using the fact that 
you can approximately represent K by a lower rank system (eee). 


= How do these findings relate to Kernel Principal Component Analysis (Chapter 14)? 


16.15 (Low-rank approximations of K ee) Denote by H an RKHS with kernel k. 
Given a set of basis functions & := {k(X1,-),...,k(%n,-)}, compute the optimal approxi- 
mation of k(x, -) in terms of Š with respect to the norm induced by H. 

Now assume that we want to approximate functions from a larger set, say S := 
{k(x1,-),---,K(Xm,-)}. Show that this leads to the approximation of the matrix K with 
Kij := k(x;, xj) by a low rank matrix K. 
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16.16 (Sparse Greedy [503] and Random Approximation [602] ee) Experimentally 
compare the settings where basis functions are selected purely at random (Nystrom 
method) with the case where they are selected in a sparse greedy fashion. 


16.17 (Optimization Problems for Laplacian Process ee) Derive the optimization 
problem for regression with a Laplacian process prior under the assumption of additive 
Gaussian noise. Hint: What is the posterior probability of an estimator f under the cur- 
rent assumptions? Compute the dual optimization problem. Is the latter easier to deal 
with? 

Now assume additive Laplacian noise. How does the optimization problem change? 


16.18 (Confidence Intervals for Laplacian Noise eee) Derive confidence intervals 
for Laplacian Noise and a Gaussian Process or Laplacian regularizer. Is there a closed- 
form expansion? Can you find an efficient sampling scheme (000)? 


16.19 (Efficient Computation of Confidence Terms ee) Using sparse greedy approx- 
imation, devise an algorithm for computing (16.106), and in particular k (K } Km) km, 
more efficiently. Hint: Use a variant of Theorem 16.5 and a rank-one update method. 


16.20 (Hyperparameter Updates for Relevance Vector Machines eee) Derive the 
update rules for the hyperparameters (16.132) and (16.133). Hint: See [338] for details. 


16.21 (Parameter Coding for Relevance Vector Machines cco) Denote by p(aj) a 
prior on the coefficients œ; of a kernel expansion. Can you find a deconvolution function 
p(si) such that 


p(ai) = [Onsite Pei pends? (16.145) 


Hint: Use the Fourier transformation. What does this mean when encoding sparse priors 
such as p(ai) = se lel? Can you construct alternative training algorithms for Laplacian 
Processes? 


16.22 (RVM and Generative Topographic Mapping ooo) Apply the RVM method as 
a prior to the Generative Topographic Mapping described in Section 17.4.2; in other words, 
instead of using X; a? as the negative log prior probability on the weights of the individual 
nodes. 

Can you find an incremental approach? (Hint: Use the method described in [539]) 
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Regularized Principal Manifolds 


In Chapter 14, which covered Kernel Principal Component Analysis and Kernel 
Feature Analysis, we viewed the problem of unsupervised learning as a problem 
of finding good feature extractors. This is not the only possible way of extracting 
information from data, however. 

For instance, we could determine the properties that best describe the data; in 
other words, that represent the data in an optimal compact fashion. This is useful 
for the purpose of data visualization, and to test whether new data is generated 
using the same distribution as the training set. Inevitably, this leads to a (possibly 
quite crude) model of the underlying probability distribution. Generative models 
such as Principal Curves [231], the Generative Topographic Mapping [52], several 
linear Gaussian models [444], and vector quantizers [21] are examples thereof. 

The present chapter covers data descriptive models. We first introduce the 
quantization functional [509] (see Section 17.1), which plays the role of the risk 
R[f] commonly used in supervised learning (see Chapter 3). This allows us to use 
techniques from regularization theory in unsupervised learning. In particular, it 
leads to a natural generalization (to higher dimensionality and different criteria of 
regularity) of the principal curves algorithm with a length constraint [292], which 
is presented in Section 17.2, together with an efficient algorithm (Section 17.3). 

In addition, we show that regularized quantization functionals can be seen in 
the context of robust coding; that is, optimal coding in the presence of a noisy chan- 
nel. The regularized quantization error approach also lends itself to a comparison 
with Bayesian techniques based on generative models. Connections to other algo- 
rithms are pointed out in Section 17.4. The regularization framework allows us to 
present a modified version of the Generative Topographic Mapping (GTM) [52] 
(Section 17.4.2), using recent developments in Gaussian Processes (Section 16.3). 

Finally, the quantization functional approach also provides a versatile tool to 
find uniform convergence bounds. In Section 17.5, we derive bounds on the quan- 
tization error and on the rate of convergence that subsume several existing results 
as special cases. This is possible due to the use of functional analytic tools. 

Readers mainly interested in the core algorithm are best served by reading the 
first three sections and possibly the experimental part (Section 17.6). Chapters 3 
and 4 are useful to understand the formulation of regularized quantization func- 
tionals. Section 17.4 is mainly relevant to readers interested in Bayesian alterna- 
tives, such as the Generative Topographic Mapping. Clearly, knowledge of the ba- 
sic Bayesian concepts presented in Section 16.1 will be useful. Finally, Section 17.5 
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requires an understanding of uniform convergence bounds and operator theoretic 
methods in learning theory (Section 12.4). 
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The basic idea of the quantization error approach is that we should be able to 
learn something about data by learning how to efficiently compress or encode it as a 
simpler yet still meaningful object. The quality of the encoding is assessed by the 
reconstruction error (the quantization error) it causes (in other words, how close 
the reconstructions come to the initial data), and the simplicity of the device that 
generates the code. The latter is important, since the coding device then contains 
the information we seek to extract. 

Unlike most engineering applications, we also allow for continuous codes. Prac- 
tical encoding schemes, on the other hand, concern themselves with the number 
of bits needed to code an object. This reflects our emphasis on information extrac- 
tion by learning the coding device itself. As we will see in Section 17.4.3, however, 
constraints on the simplicity of the coding device are crucial to avoid overfitting 
for real valued continuous codes. 


1 


17.1.1 Quantization Error 


Let us begin with the usual definitions: denote by X a (possibly compact subset 
of a) vector space, and by X := {x1,...,Xm} C X a dataset drawn iid from an 
unknown underlying probability distribution P(x). The observations are members 
of X. Additionally, we define the index sets Z, maps f : Z —> X, and classes F of such 
maps (with f € F). Here Z is the domain of our code, and the map f is intended 
to describe certain basic properties of P(x). In particular, we seek f such that the 
so-called quantization error, 


R[f]:= i dP 17.1 

[l= f minec, aPC), (17.1) 
is minimized. In this setting, c(x, f(z)) is the loss function determining the error of 
reconstruction. We very often set c(x, f(z)) = ||x — f(z)||?, where ||- || denotes the 


Euclidean distance. Unfortunately, the problem of minimizing R[f] is insolvable, 


1. Consider the task of displaying an image with 24 bit color depth on an 8 bit display 
with color lookup table (CLUT), meaning that the 256 possible colors may be chosen 
from a 24 bit color-space. Simply keeping the most significant bits of each color is not a 
promising strategy: images of a forest benefit from an allocation of many colors in the green 
color-space, whereas images of the sky typically benefit from a dominance of white and 
blue colors in the CLUT. Consequently, the colors chosen in the CLUT provide us with 
information about the image. 
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Figure 17.1 Sample Mean in R’. The observations are mapped to one codebook vector, 
denoted by the triplet (1, x, y). Decoding is done by mapping the codebook vector back to 
(x,y). Obviously, the coding error is given by the average deviations between the sample 
mean and the data. 


as P is generally unknown. Hence we replace P using the empirical density 
1 m 

Pin(x) = — X d(x — x3), (17.2) 
mM iZi 

and instead of (17.1) we analyze the empirical quantization error 

Remplf1:= f mine(x, f(e))AP n(x) = ~ È minele, fe). (17.3) 


The general problem of minimizing (17.3) is ill posed [538, 370]. Even worse, with 
no further restrictions on F, small values of Remp[f] fail to guarantee small values 


of R[f]. 
Many problems in unsupervised learning can be cast in the form of finding a 
minimizer of (17.1) or (17.3). Let us consider some practical examples. 


17.1.2 Examples with Finite Codes 


We begin with cases where Z is a finite set; we can then encode f by a table of all 
its possible values. 


Example 17.1 (Sample Mean) Define Z := {1}, F to be the set of all constant functions, 
and f(1) € X. In addition, set c(x, f(z)) = ||x — f (2)||?. Then the minimum of 


Rf] = f \|x — f|| 4P(x) and Renpl f] = m Èa) — x;||? (17.4) 


yields the variance of the data. The minimizers of the quantization functionals can in this 
case be determined analytically, 


argmin R[f] = f xdP(x) and argmin Rempl f] = = be (17.5) 
FEF x FEF mM =a 


This is the (empirical) sample mean (see also Figure 17.1). 


Example 17.2 (k-means Vector Quantization) Define Z := [k] and f : i > fi with f; € 
X, and denote by F the set of all such functions. If we again use c(x, f (2)) = ||x — f(z)||, 
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Figure 17.2 k-means clustering in R (in this figure, k = 3). The observations are mapped to 
one of the codebook vectors, denoted by the triplet (7, x;, y;). Decoding is done by mapping 
the codebook vectors back to (x;, yj). 


then 
fl= f mi min [lx — fal|?dP@) (17.6) 


denotes the canonical distortion error of a vector quantizer. In practice, we can use the 
k-means algorithm [332] to find a set of vectors {f1,..., fk} minimizing Rempl f] (see 
Figure 17.2). Furthermore there exist proofs of the convergence properties of the minimizer 
of Remp[f] to one of the minimizer(s) of R f] (see [21]). 


Note that in this case, minimization of the empirical quantization error leads to 


local minima, a problem quite common in this type of setting. A different choice 
of loss functions c leads to a clustering algorithm proposed in [73]. 


Example 17.3 (k-median and Robust Vector Quantization) Beginning with the def- 


initions of the previous example, and choosing c(x, f(z)) := ||x — f(z)||1 we obtain the 
k-median problem. Recall that || - ||; is the city-block metric. In this case, 
2 | min ||x — f:{|1dP(2). (17.7) 
X zelk] 


This setting is robust against outliers, since the maximum influence of each pattern is 
bounded. An intermediate setting can be derived from Huber’s robust loss function [251] 
(see also Table 3.1). Here, we define 


c(x, f(2)) =l ae for ||x— f&I < ø, 


17.8 
— f(z)||— $ otherwise, ee) 


for suitably chosen o. Eq. (17.8) behaves like a k-means vector quantizer for small x;, but 
with the built-in safeguard of a limit on the influence of each individual pattern. 


17.1.3 Examples with Infinite Codes 
Instead of discrete quantization, we can also consider a mapping onto a manifold 


of dimensionality lower than that of the input space. PCA (see also Sections 14.1 
and 14.4.1) can be viewed in the following way [231]: 
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Example 17.4 (Principal Components) Define Z := R, f :z > fo +z- fi with fo, fir € 
X, ||fil| = 1, and F to be the set of all such line segments. Moreover, let c(x, f(z)) := 
|x — f(z)||?. Then the minimizer of 


REI := a min [lx — fo = z- fullaP() (17.9) 


z€[0,1 


over f € F yields a line parallel to the direction of largest variance in P(x) (see [231] and 
Section 14.1). 


A slight modification results in simultaneous diagonalization of the covariance 
matrix with respect to an additional metric tensor. 


Example 17.5 (Transformed Loss Metrics) Denote by D a strictly positive definite 
matrix. With the definitions above and the loss function 


(x, f(2)) = (x — f2) D(x- Ff), (17.10) 


the minimizer of the empirical quantization can be found by simultaneous diagonalization 
of D and the covariance matrix cov(x). 


This can be seen as follows: We replace x by %:= D~2x and f by f := D~? f. 
Now c(x, f(z)) = || — f(2)||?, hence we have reduced the problem to one of finding 
principal components for the covariance matrix D-?cov(x)D7?. This, however, is 
equivalent to simultaneous diagonalization of D and cov(x), which completes the 
proof. 

Further choices of c, such as the || - ||; metric or Huber’s robust loss function, lead 
to algorithms that are less prone to instabilities caused by outliers than standard 
PCA. 

A combination of k-means clustering and principal components leads to the k- 
planes clustering algorithm proposed in [72] (also known as Local PCA by Kamb- 
hatla & Leen [276, 277]).? Here, clustering is carried out with respect to k planes 
instead of k cluster points. After assigning the data points to the planes, the latter 
are re-estimated using PCA (thus, the directions with smallest variance are elimi- 
nated). Both Kambhatla & Leen [277] and Bradley & Mangasarian [72] show that 
this can improve results on certain datasets. 

Hastie & Stuetzle [231] extended PCA in a different direction by allowing f(z) 
to be other than a linear function. 


2. While [277] introduces the problem by considering local linear versions of Principal 
Component Analysis and takes a Neural Networks perspective, [73] treats the task mainly 
as an optimization problem for which convergence to a local minimum in a finite number 
of steps is proven. While the resulting algorithm is identical, the motivation in the two cases 
differs significantly. In particular, the ansatz in [73] makes it easier for us to formulate the 
problem as one of minimizing a quantization functional. 

The original local linear Vector Quantization formulation put forward in [276] also allows 
us to give a quantization formulation for local PCA. To achieve this, we simply consider 
linear subspaces together with their enclosing Voronoi cells. 
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Figure 17.3 Projection of points x onto a manifold f(Z), in this case with minimum 
Euclidean distance. 


Example 17.6 (Principal Curves and Surfaces) We define Z := [0,1]4 (with d € N 
and d > 1 for principal surfaces), f : z + f(z), with f € F a class of continuous R'- 
valued continuous functions (possibly with further restrictions), and again c(x, f(z)) := 
\|x — f(z)||?. The minimizer of 


REI := I, min ||x — f(z)||2dP(x) (17.11) 


ze€[0,1]4 


is not well defined, unless F is a compact set. Moreover, even the minimizer of Remp[f] is 
generally not well defined either. In fact, it is an ill posed problem in the sense of Arsenin 
and Tikhonov [538]. Until recently [292], no uniform convergence properties of R empl f] 
to R[f] could be stated. 


Kégl et al. [292] modified the original principal curves algorithm in order to prove 
bounds on R[f] in terms of Remp[f] and to show that the resulting estimate is 
well defined. The changes imply a restriction of F to polygonal lines with a fixed 
number of knots, and most importantly, fixed length L.’ 

This is essentially equivalent to using a regularization operator. Instead of a 
length constraint, which as we will show in section 17.2.2, corresponds to a par- 
ticular regularization operator, we now consider more general smoothness con- 
straints on the estimated curve f(x). 


17.2 A Regularized Quantization Functional 


What we would essentially like to have are estimates that not only yield small 
expected quantization error but are smooth curves (or manifolds) as well. The 
latter property is independent of the parametrization of the curve. It is difficult 
to compute such a quantity in practice, however. An easier task is to provide 
a measure of the smoothness of f depending on the parametrization of f(z). A 
wide range of regularizers from supervised learning can readily be used for this 
purpose. As a side effect, we also obtain a smooth parametrization. 


3. In practice, Kég] et al. use a constraint on the angles of the polygonal curves, rather than 
the actual length constraint, to achieve sample complexity rates on the training time of the 
algorithm. For the uniform convergence part, however, the length constraint is used. 
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Strategy 


Table 17.1 Comparison of the three basic learning problems considered in this book: su- 
pervised learning, quantization, and feature extraction. Note that the optimization prob- 
lems for supervised learning and quantization can also be transformed into the problem of 
maximizing Remp[f] subject to the constraint Q[f] < A. 


Data DOS Niwas heh Ou MSAK ag kh CX KSA sgt CX 
Y= {Y1 Ym} CY 


Objective minimize Test Error minimize Coding Error max. Interestingness 


RIISE pn se] | RIA Emines) | RAEO 


Typical elx, y, F(x) = (y — FP (x, f2) = ll- FOR IF) = fF 
Examples | c(x, y, f) = |y - fl- e(x, f2) = |x- flh qf (x)) = f(x)" 


Loss Mismatch between Approximation Anomality of 
f(x) and y of x by f(2(2)) f) via qF) 


Empirical Training Error Approximation Error Contrast 
m m m 


Quantity Remplf] = 2, eis yi FH) Rempl f] Be QII = dL ao) 


Remp f] + AQI f] Rempl f] + AQI f] QIf] subj. to Q[f] < A 


We now propose a variant to minimizing the empirical quantization functional, 
which seeks hypotheses from certain classes of smooth curves, leads to an algo- 
rithm that is readily implemented, and is amenable to the analysis of sample com- 
plexity via uniform convergence techniques. We will make use of a regularized 
version of the empirical quantization functional. Let 


Rreg f] = Remplf] + AQIA], (17.12) 


where Q[f] is a convex nonnegative regularization term, and A > 0 is a trade- 
off constant determining how much simple functions f should be favored over 
functions with low empirical quantization error. We now consider some possible 
choices of Q. This setting is very similar to those of supervised learning (4.1) and 
the feature extraction framework (14.35). Table 17.1 gives an overview. In all three 
cases we have the following three step procedure: 


(i) Start with a measure of optimality (expected risk, quantization error, criterion 
of interestingness of the estimate f on the data) with respect to a distribution P(x) 


(ii) Replace the integration over P(x) by a sum over samples drawn iid from P(x) 


(iii) To ensure numerical stability and guarantee smooth estimates, add a regular- 
ization term (usually quadratic or linear) 
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17.2.1 Quadratic Regularizers 


As we have seen several times (see Chapter 4), quadratic functionals [538] are a 
very popular choice of regularizer, and in the present case we can make use of the 
whole toolbox of regularization functionals, regularization operators, and Repro- 
ducing Kernel Hilbert Spaces. This illustrates how the simple transformation of a 
problem into a known framework can make life much easier, and it allows us to 
assemble new algorithms from readily available building blocks. 

In the present case, Q[ f] = || f||5,, these building blocks are kernels, and we may 
expand f as 


M 

f(2) = fo + ¥ aik(zi,z), with z; € Z, aj € X, and k : Z? > R, (17.13) 
i=1 

given previously chosen nodes Z1,...,Z (of which we take as many as we can 

afford in terms of computational loss). It can be shown (see Problem 17.2) that 

the back-projections of the observations x; onto f are in fact the most suitable 

expansion points.+ Consequently, the regularization term can be written as 


M 
Ifl = YX (ai, aj)k(zi, zj) (17.14) 


i,j=l 


This is the functional form of ||f ||}; needed to to derive efficient algorithms. 
17.2.2 Examples of Regularization Operators 


In our first example, we consider the equivalence between principal curves with a 
length constraint and minimizing the regularized quantization functional. 


Example 17.7 (Regularizers with a Length Constraint) By choosing the differentia- 
tion operator Y := ô; \|Y f ||? becomes an integral over the squared “speed” of the curve. 
Re-parametrizing f to constant speed leaves the empirical quantization error unchanged, 
whereas the regularization term is minimized. This can be seen as follows: By construction, 
Jio ||Ozf(2)||dz does not depend on the (re-)parametrization. The variance, however, is 
minimal for a constant function, hence ||- f (z)|| has to be constant over the interval [0,1]. 
Thus, ||Y f ||? equals the squared length L? of the curve at the solution. 


Minimizing the sum of the empirical quantization error and a regularizer, how- 
ever, is equivalent to minimizing the empirical quantization error for a fixed value 
of the regularization term (when Ais suitably adjusted).° Hence the proposed algo- 


4. In practice, however, such expansions tend to become unstable during the optimization 
procedure. Hence, a set of z; chosen a priori, for instance on a grid, is the default choice. 

5. The reasoning is not completely true for the case of a finite number of basis functions 
— f cannot then be completely re-parametrized to constant speed. The basic properties still 
hold, however, provided the number of kernels is sufficiently high. 
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Figure 17.4 Two-dimensional periodic structures in R°. Left: unit sphere, which can be 
mapped to [0, 7] x R/27; right: toroid, which is generated from [0,27]’. 


rithm is equivalent to finding the optimal curve with a length constraint; in other 
words, it is equivalent to the algorithm proposed in [292]. 

As experimental and theoretical evidence from regression indicates, it may be 
beneficial to choose a kernel that also enforces higher degrees of smoothness in 
higher derivatives of the estimate. Hence, we could just as well use Gaussian RBF 
kernels (2.68), 

k(x, x!) = exp (-=") (17.15) 


20° 


This corresponds to a regularizer penalizing all derivatives simultaneously (see 
Section 4.4.2 for details, and Section 17.6 for experimental results with this kernel). 

The use of periodical kernels (see Section 4.4.4) has interesting consequences 
in the context of describing manifolds. Such kernels allow us to model circular 
structures in X, hence we can find good approximations to objects such as the 
surface of balls or “donut”-shaped distributions (see Figure 17.4). Nonetheless, 
we have to keep in mind that this requires the spatial connectivity structure to be 
known (which we may not always assume). 

The appealing property of this formulation is that it is completely independent 
of the dimensionality and of any particular structure of Z. 


17.2.3 Linear Programming Regularizers 


It may not always be desirable to find expansions of f = $~; a;k(z;, +) in terms of 
many basis functions k(x;, x). Instead, it is often better to obtain an estimate of f 
with just a few basis functions (but usually of almost equal quality). This can be 
achieved via a regularizer that enforces sparsity (see Section 4.9.2), for example by 
setting Q[f] := Z; |ai|. For a; € R’, we use 


M M d 
olf] = > lleilh => > lal (17.16) 


i=1 j=1 
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instead. It is shown in [509] that by using an argument similar to that in [505], this 
setting also allows efficient capacity control. 

In several cases, (17.16) may not be exactly what we are looking for. In particular, 
even if a;; were nonzero for only one dimension j, we would still have to evaluate 
the corresponding kernel function. The regularizer 


M 
= i 17.17 
to = Delay anes 


OF] = lla; 


overcomes this limitation. The emphasis in this instance is on the maximum 
weight of a;j for a given i. It is possible to show that regularizers of type (17.17) 
can also be cast in a linear or quadratic programming setting, provided the loss 
function c is only linear or quadratic (see Problem 17.4). 
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In this section, we present an algorithm that approximately minimizes R,eg[f] via 
coordinate descent. We certainly do not claim it is the best algorithm for this 
task — our goal is simply to find an algorithm consistent with our framework 
(which is amenable to sample complexity theory), and which works in practice. 
Furthermore, commonly known training algorithms for the special cases in Section 
17.1.2, such as k-means algorithms and the k-planes algorithm of Section 17.1.3, are 
special cases of the algorithm we propose. 

In the following, we assume the data to be centered and therefore drop the term 
fo in the kernel expansion (17.13) of f. This greatly simplifies the notation (the 
extension is straightforward). We further assume, for the sake of practicability, that 
the ansatz for f can be written in terms of a finite number of parameters ay,...am 
(see the representer theorem for regularized principal manifolds in Problem 17.2), 
and that likewise the regularizer Q[f] can also be expressed as a function of 
a1,- --,@m. This allows us to rephrase the problem of minimizing the regularized 
quantization functional as 


m 


* YY cis Franson) G)) + AQ(O1; T (17.18) 


{ayz...ay}CXx | mM i=] 
{Gy pa Cm} CR = 


This minimization is achieved in an iterative fashion by coordinate descent over 
Ç and a, in a manner analogous to the EM (expectation maximization) algorithm 
[135]. Recall that the aim of the latter procedure is to find the distribution P(x), or at 
least the parameters 8 of a distribution P(x, 1), where x are observations and l are 
latent variables. Keeping 6 fixed, we first accomplish the E-step by maximizing 


6. Coordinate descent means that to minimize a function f(x1,..., Xn) of several (possibly 
vector valued) variables x1,...,X„, we minimize f only with respect to one variable at a 
time while keeping the others fixed. 
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Po(x,/) with respect to l; the M-step then consists of maximizing Pg(x,/) with 
respect to 9. These two steps are repeated until no further improvement can be 
achieved. 

Likewise, to solve (17.18), we alternate between minimizing with respect to 
{G,---; Gn}, which is equivalent to the E-step (projection), and with respect to 
{ai,...,am}, corresponding to the M-step (adaptation). This procedure is re- 
peated until convergence; or, in practice, until the regularized quantization func- 
tional stops decreasing significantly. Let us now have a closer look at the individ- 
ual phases of the algorithm. 


17.3.1 Projection 


For each i € [m], choose Ç; := p i c(x;, f(¢)); thus for squared loss, Ġ; := 
EZ 


argmin cez [xi — f() )||?. Clearly, i fixed a;, the resulting ¢; minimize the loss 
term in (17.18), which itself is equal to Ryeg[f] for given a; and X. Hence Ryeg[f] is 
decreased while keeping Q[f] fixed (since the variables a; do not change). In prac- 
tice we use standard low dimensional nonlinear function minimization algorithms 
(see Section 6.2 and [423] for details and references) to achieve this goal. 

The computational complexity is O(m - M) since the minimization step has to 
be carried out for each sample separately. In addition, each function evaluation 
(the number of which we assume to be approximately constant per minimization) 
scales linearly with the number of basis functions M. 


17.3.2 Adaptation 


Next, the parameters Ç; are fixed, and a; is adapted such that Ryeg[ f] decreases 
further. The design of practical algorithms to decrease Ryeg[f] is closely connected 
with the particular forms taken by the loss function c(x, f(z)) and the regularizer 
QLf]. We restrict ourselves to squared loss in this section (c(x, f(z)) = ||x — f(z)|]?) 
and to the quadratic or linear regularization terms described in section 17.2. We 
thus assume that f(z) = ©“, ajk(x;, x) for some kernel k, which matches the regu- 
larization operator Y in the quadratic case. 


Quadratic e The problem to be solved in this case is to minimize 


Xi — > ajk(z;, Gi) ty 1S (ai, aj)k(zi, zj) (17.19) 


2 im 


with respect to œ. This is equivalent to a multivariate regression problem where Q; 
are the patterns and x; the target values. Differentiation of (17.19) with respect to a; 
yields 


—1 
TK.) KĮ X, (17.20) 


(°K. KK) a =KĮX, and hence a = (> 
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where (Kz) jj := k(z;, zj) is an M x M matrix and (K¢)jj ‘= k(¢i,2;) is m x M. More- 
over, with slight abuse of notation, a, and X denote the matrices of all parameters 
and samples, respectively. 

The computational complexity of the adaptation step is O(M? - m) for the matrix 
computation, and O(M?°) for the computation of the parameters a;. Assuming ter- 
mination of the overall algorithm in a finite number of steps, the overall complex- 
ity of the proposed algorithm is O(M?) + O(M? - m); thus, it scales linearly in the 
number of samples (but cubic in the number of parameters).” 


Linear Regularizers In this case, the adaptation step can be solved via a quadratic 
optimization problem. The trick is to break up the ¢; norms (that is, the city-block 
metric) of the coefficient vectors q; into pairs of nonnegative variables a; — až, 
thus replacing ||a;||1 by (ai, I) + laž, T). Consequently we have to minimize 


1 m 
md 


with the constraint that a;,«* belong to the positive orthant in X. Here T denotes 
the vector of ones in R’. Optimization is carried out using standard quadratic 
programming codes (see Chapters 6 and 10, and [380, 253, 556]). Depending on 
the particular implementation of the algorithm, this has an order of complexity 
similar to a matrix inversion, this is the same number of calculations needed to 
solve the unconstrained quadratic optimization problem described previously. 


2 M 
+r ¥ (a; + 07,1), (17.21) 
i=1 


M 
Xi $ (aj = at )k(z;, G) 
j=l 


An algorithm alternating between the projection and adaptation step, as described 
above, generally decreases the regularized risk term and eventually converges to 
a local minimum of the optimization problem (see Problem 17.6). What remains is 
to find good starting values. 


17.3.3 Initialization 


The idea is to choose the coefficients a; such that the initial guess of f approxi- 
mately points in the directions of the first D principal components given by the 
matrix V := (v1,...,0p). This is done in a manner analogous to the initialization 
in the generative topographic mapping (eq. (2.20) of [52]); 


M 

min — ) c(V(zi— Zo) — Zi)) + AQLf]. 17.22 
fesse M a ( ( i 0) Toisaa i)) [f] ( ) 
Hence for squared loss and quadratic regularizers, a is given by the solution of 
(41 + K2) a = V(Z — Zo), where Z denotes the matrix of z;, zo the mean of z;, and 
Zo the matrix of m identical copies of Zo. If we are not dealing with centered data 
as assumed, fo is set to the sample mean, fy = + D2; Xi. 


7. Note that the memory requirements are also at least O(M - m), and that for optimal 
performance M should increase with increasing m. 
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There exists a strong connection between Regularized Principal Manifolds (RPM) 
and Generative Models, and in particular to the Generative Topographic Map 
(GTM). The main difference is that the latter attempts to estimate the density of the 
data, whereas the quantization functional is mainly concerned with approximat- 
ing the observations X. In a nutshell, the relation to Generative Models is similar 
to the connection between classification or regression estimates and Maximum a 
Posteriori estimates (see Chapter 16) in supervised learning: the prior probability 
over different generative models, in our case decoders such as manifolds and finite 
sets, plays the role of a regularization term. 


17.4.1 Generative Models 


Let us begin with the basic setting of a generative model® in Bayesian estima- 
tion (for the basic properties see Chapter 16). We present a simplified version 
here. Denote by Pa a distribution parametrized by a, and by P(q) a prior prob- 
ability distribution over all possible values of œ, which encodes our prior beliefs 
as to which distributions Py are more likely to occur. Then, by Bayes’ rule (Sec- 
tion 16.1.3), the posterior probability of a distribution Pa given the observations 
X = {x1,...,Xm} C X is given by 


P(a|X) = ae 


Since we cannot compute P(X), it is usually ignored, and later reintroduced as a 
normalization factor. We exploit the iid assumption on X to obtain 


(17.23) 


P(X|a) = J] Pet). (17.24) 
i=1 


Taking the log (17.23) and substituting in (17.24) yields the log posterior probabil- 
ity, 

InP(a|X) = ¥ InP, (x;) + InP(a) +c. (17.25) 

i=1 

Here c is the obligatory additive constant we got by ignoring P(X). As with super- 
vised learning, (17.23) is very similar to the regularized quantization functional, if 
we formally identify Pa(x;) with the negative quantization error incurred by en- 
coding x; and — In P(q) with the regularizer. 


8. The term generative model is just a synonym for density model. The generative part comes 
from the fact that, given a density model, we can draw (=generate) data from this model, 
which will then be distributed similarly to the original data. 
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The difference lies in the form of P,(x;). Whereas the quantization functional 
approach assumes optimal encoding via (17.1), the generative model approach 
essentially assumes stochastic encoding via latent variables. For convenience, we 
denote the latter by ¢. In general, P(x;|a@, ¢) takes on a relatively simple form, such 
as anormal distribution around some f,,(¢) with variance g°. In this case, we set 


P(x|a, C) = (210°)? exp (Hot = — : (17.26) 


202 


Integration over P(C) yields 
P(xla) = f Pixa, C)dP(C). (17.27) 


The problem is that P(¢) itself is unknown and must be estimated as well. Further- 
more, we would like to find a P(¢) such that P(X|a) is maximized. In fact, there 
exists an algorithm to iteratively improve P(X|a) — the EM algorithm [135] (we 
already encountered a similar method in Section 17.3). 

Without going into further detail (for the special case of the GTM, see [52, 51]), 
the algorithm works by iteratively estimating Pr(¢) via Bayes’ rule, 


Pa|, @) 
Ly P(x'l¢, a) i 
and subsequently maximizing the log posterior under the assumption that P(¢) 
is fixed. This process is repeated until convergence occurs. The latent variables ¢ 
play a role similar to the projections back onto the manifold in the RPM algorithm. 
Let us now have a closer look at an example: Generative Topographic Mappings. 


P(¢|x, a) = (17.28) 


17.4.2 The Generative Topographic Mapping 


The specific form of P(a), f, and P(x|q) in the GTM is as follows: P(x|ax, Ç) is taken 
from (17.26), and ¢ is assumed to belong to a low dimensional grid; for instance, 
¢ € [1, p]“, where d is the dimensionality of the manifold. Hence, a finite number 
of “nodes” ¢; may be “responsible” for having generated a particular data-point 
x;. This is done in order to render the integral over d¢ and the computation of 
P(¢|x, œ) practically feasible (see Figure 17.5). Moreover, f(¢) is the usual kernel 
expansion; in other words, for some G we have 


fo(G) = È aik(Ci, ¢). (17.29) 


Bishop et al. [52] choose Gaussian RBF kernels (17.15) for (17.29). 
Finally, we need a prior over the class of mappings f. In the initial version [52], 
a Gaussian prior over the weights a; was chosen, 


_ n- exp (la 
Pa) = ] [(2r4’) exp . (17.30) 


2w2 


Here w denotes the variance of the coefficients and n is the dimensionality of X and 
Qai. Unfortunately, this setting depends heavily on the number of basis functions 
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Figure 17.5 The function f,(¢) describes the mapping from the space of latent variables ¢ 
into X. Around each f(¢;) we have a normal distribution with variance o° (after [52]). 


k(¢/, ¢). Hence, in order to overcome this problem, [51] introduced a Gaussian 
Process prior (see Section 16.3); thus 


P(a) = (2) |K|? exp (-Ztonenec¢.<)) , (17.31) 
1] 


Due to the Representer Theorem (Th. 4.2), we only need as many basis functions 
as we have ¢;, and in particular, the centers may coincide (see Problem 17.9). 

Many of the existing extensions in the regularization framework (e.g., semipara- 
metric settings) and recent developments in Bayesian methods (Relevance Vector 
Priors and Laplacian Priors) could also be applied to the GTM (see Chapter 4, Sec- 
tions 16.5 and 16.6, and Problems 17.10 and 17.11 for more details). 


17.4.3 Robust Coding and Regularized Quantization 


From a mere coding point of view, it might not seem too obvious at first glance 
that we need very smooth curves. In fact, one could construct a space-filling curve 
(see Figure 17.6). This would allow us to achieve zero empirical and expected 
quantization error, by exploiting the fact that codewords may be specified to 
arbitrary precision. The codebook in this setting would have to be exact, however, 
and the resulting estimate f would be quite useless for any practical purpose. 

The subsequent reasoning explains why such a solution f is also undesirable 
from a learning theory point of view. Let us modify the situation slightly and 
introduce a noisy channel; that is, the reconstruction does not occur for 


G(x) = argmin c(x, f(¢)), (17.32) 
cez 


but for the random variable (x) with 
C(x) := argmin c(x, f(C)) + £. (17.33) 
CEZ 


Here € is a symmetrically distributed random variable drawn according to P(E), 
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Figure 17.6 The Peano curve as an example of a space filling curve. We observe that with 


increasing order, the maximum distance between any point and the curve decreases as 
9-41). 


with zero mean and variance co”. In this case, we have to minimize a slightly 
different risk functional given by 


Roel fl= f. c (xf (argmin ex, ©) +E) POPO. (17.34) 
XxR zEZz 


This modified setting rules out space-filling curves such as the Peano curve, since 
the deviation in the encoding could then vary significantly. Eq. (17.34) is inspired 
by the problem of robust vector quantization [195], and Bishop’s proof [50] that 
in supervised learning training with noise is equivalent to Tikhonov regulariza- 
tion. We use an adaptation of the techniques of [50] to derive a similar result in 
unsupervised learning. 

Assume now that c(x, f(€)) is the squared loss (x — f(€)). If the overall influence 
of € is small, the moments of order higher than two are essentially negligible, 
and if f is twice differentiable, we may expand f as a Taylor expansion with 
f(EC+O RS f(QO+H+E (OE) + EPO). Using the reasoning in [50], we arrive at 


Rrosel fI = RIFI +2 f EPO f IOPE =x, FOP 
= RIFI +20? [IP OIP + SFO) -x fOO, (17.35) 


where ¢ is defined as in (17.32). Finally we expand f at the unbiased solution fo 
(for which o = 0) in terms of a”. Since the second term in (17.35) inside the integral 
is O(a), its overall contribution is only O(o*), and thus it can be neglected. What 
remains is 


Ryoisel f] = RIF] +207 f IF OI PAP) with ¢ = ¢(x) = argmin ||x — f(¢)||?.17.36) 
X GES 


Except for fact that the integral is with respect to x (and hence with respect to some 
complicated measure with respect to ¢), the second term is a regularizer enforcing 
smoothness by penalizing the first derivative, as discussed in section 4.4. Hence 
we recover Principal Curves with a length constraint as a by-product of robust 
coding. 
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We chose not to use the discrete sample size setting as in [50], since it appears 
not to be very practicable to use a training-with-input-noise scheme as in super- 
vised learning for the problem of principal manifolds. The discretization of R[f], 
meaning its approximation by the empirical risk functional, is independent of this 
reasoning, however. It might be of practical interest, though, to use a probabilistic 
projection of samples onto the curve for algorithmic stability (as done for instance 
in simulated annealing for the k-means algorithm). 
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We next need a bound on the sample size sufficient to ensure that the above algo- 
rithm finds an f close to the best possible; or, less ambitiously, to bring the empiri- 
cal quantization error Remp[f] close to the expected quantization error R[f]. This is 
achieved by methods which are very similar to those in [292], and are based on uni- 
form (over a class of functions) convergence of empirical risk functionals to their 
expected value. The basic probabilistic tools we need are given in section 17.5.2. In 
section 17.5.3, we state bounds on the relevant covering numbers for the classes of 
functions induced by our regularization operators, 


Fai={f:Z > X| QL] <a}. (17.37) 


Recall that Q[f] = $||Yf||?, where ||Yf||? is given by (17.14). Since bounding cover- 
ing numbers can be technically intricate, we only state the results and basic tech- 
niques in the main body and relegate the proofs and more detailed considerations 
to the appendix. Section 17.5.4 gives overall sample complexity rates. 

In order to avoid several technical requirements arising from unbounded loss 
functions (like boundedness of some moments of the distribution P(x) [559]), 
we assume that there exists some r > 0 such that the probability measure of 
a ball of radius r is 1; that is, P(U,) = 1. Kégl et al. [292] showed that under 
these assumptions, the principal manifold f is also contained in U,, hence the 
quantization error is no larger than e, := max, veu, C(x, x’) for all x. For squared 
loss we have e, = 41”. 


17.5.1 Metrics and Covering Numbers 


In order to derive bounds on the deviation between the empirical quantization 
error Remp[f] and the expected quantization error R[f] (in other words, to derive 
uniform convergence bounds), let us introduce the notion of a (bracket) e-cover 
[415] of the loss function induced class 


Fe := {(x,z) > c(x, f(z) |f € F} (17.38) 
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on U,. A metric on J, is defined by letting 
A(fe, fc) := sup |c(x, f(z) —c(x, f’))|, (17.39) 


ze€Z,xEu, 
where f, f’ € F. Whilst d is the metric we are interested in, it is quite hard to di- 
rectly compute covering numbers with respect to it. By an argument from [606, 14], 
however, it is possible to upper bound these quantities in terms of correspond- 
ing entropy numbers of the class of functions J itself if c is Lipschitz continu- 
ous. Denote by J, > 0 a constant for which |c(x, x’) — c(x,x")| < lex — x”||2 for all 
x, x’, x" € U,. In this case, 


alfe, fe) < arap IO- F'Ollz (17.40) 
zE 


hence all we have to do is compute the L,,(€4) covering numbers of F to obtain the 
corresponding covering numbers of J, with the norm on J defined as 


Mlese = Sup IOl (17.41) 
zE 


The metric is induced by the norm in the usual fashion. For the polynomial loss 
c(x, f(z) := ||x — f2), we obtain le = p(2r)P-!. Given a metric p and a set F, 
the € covering number of F, written N(e, F, p) (also Ne wherever the dependency 
is obvious), is the smallest number of p-balls of radius € of which the union 
contains F. With the above definitions, we can see immediately that N(e, F., d) < 


N ($ Flr lese) 
17.5.2 Upper and Lower Bounds 


The next two results are similar in their flavor to the bounds obtained in [292]. 
They are slightly streamlined since they are independent of certain technical con- 
ditions on F used in [292]. 


Proposition 17.8 (L.,(¢4) bounds for Principal Manifolds) Denote by F a class of 
continuous functions from Z into X C U,, and let P(x) be a distribution over X. If m 
points are drawn iid from P(x), then for all n > 0,€ € (0, 7/2), 


i sap ReimpLfl — RIA] | > n} < IN (55,5, LA) e e, (17.42) 
JEF 
Proof By the definition of Ren» Lf] = = 5 min, || f(z) — x;||*, the empirical quanti- 
i=l 


zation functional is an average over m iid random variables that are each bounded 
by ec. Hence we may apply Hoeffding’s inequality (Theorem 5.1) to obtain 


Hi 
a E 


The next step is to discretize F. by a 5 cover (that is, F by a = cover) with 


respect to the metric d: for every fe € F. there exists some f; in the cover such 


Rempl f] — RI J| a n} Se Te, (17.43) 
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that |R[f] — RIfi]| < § and |Remp[f] — RempLfi]| < 5. Consequently, 


P { |Remplf] — RIFI > n} < P {|Rempl fil — RL] > n-e}. (17.44) 
Substituting (17.44) into (17.43) and taking the union bound over the 5 cover of Fe 
gives the desired result. E 


This result is useful to assess the quality of an empirically found manifold. In order 
to obtain rates of convergence, we also need a result connecting the expected 
quantization error of the principal manifold fénp minimizing R¢,,[f] and the 
manifold f* with minimal quantization error R[f*]. 


Proposition 17.9 (Rates of Convergence for Optimal Estimates) Suppose F is com- 
pact (thus fem, and f* exist as defined). With the definitions of Proposition 17.8, 


_ my —«)2 


> nh <2 (ES Lae) +1)¢ a (17.45) 


P {sep [Rifin] — RIF 
fEF 
The proof is similar to that of proposition 17.8, and can be found in Section A.2 


17.5.3 Bounding Covering Numbers 


Following propositions 17.8 and 17.9, the missing ingredient in the uniform con- 
vergence bounds is a bound on the covering number N(e, F). 

Before going into detail, let us briefly review what already exists in terms of 
bounds on the covering number N for L,.(¢4) metrics. Kégl et al. [292] essentially 
show that 


InN(e, F) = O(4) (17.46) 


under the following assumptions: They consider polygonal curves f(-) of length 
L in the ball U, C X. The distance measure (no metric!) for N(e) is defined as 
sup, cy, A(x, f) — A(x, f)| < e. Here A(x, f) is the minimum distance between a 
curve f(-) and x € U,. 

By using functional analytic tools developed in [606] (see Chapter 12) we can 
obtain results for more general regularization operators, which can then be used 
in place of (17.46) to obtain bounds on the expected quantization error. 

While it is not essential to the understanding of the main results to introduce 
entropy numbers directly (they are essentially the functional inverse of the cover- 
ing numbers N(<,F), and are dealt with in more detail in Chapter 12 and [509]), 
we need to define ways of characterizing the simplicity of the class of functions 
via the regularization term under consideration. 

From Mercer’s Theorem (Theorem 2.10), we know that every kernel may be 
written as a dot product in some feature space, 


k(x, x’) = $, AiGi(x) il’). (17.47) 
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The eigenvalues A; determine the shape of the data mapped into feature space (cf. 
Figure 2.3). Roughly speaking, if the A; decay rapidly, the possibly infinite expan- 
sion in (17.47) can be approximated with high precision by a low-dimensional 
space which means that we are effectively dealing only with simple functions. 

Recall that in Section 12.4, and specifically in Section 12.4.6, we stated that for a 
Mercer kernel with A; = O(e~%”) for some a, p > 0, 


InN(J,€) =O (log e) (17.48) 
Moreover if k is a Mercer kernel with \; = O(j7°71) for some a > 0, then 
nNF,e&=0 (e+?) (17.49) 


for any 6 € (0, a/2). The rates obtained in (17.48) and (17.49) are quite strong. In 
particular, recall that for compact sets in finite dimensional spaces of dimension 
d, the covering number is N(e, F) = O(e~“) (see Problem 17.7 and [90]). In view 
of (17.48), this means that even though we are dealing with a nonparametric 
estimator, it behaves almost as if it were a finite dimensional estimator. 

All that is left is to substitute (17.48) and (17.49) into the uniform convergence 
results to obtain bounds on the performance of our learning algorithm. Due to the 
slow growth in N(e, F), we are able to prove good rates of convergence below. 


17.5.4 Rates of Convergence 


Another property of interest is the sample complexity of learning Principal Mani- 
folds. Kég] et al. [292] showed a O(m-") rate of convergence for principal curves 
(d = 1) with a length constraint regularizer. We prove that by utilizing a more pow- 
erful regularizer (as is possible using our algorithm), we may obtain a bound of 
the form O(m~2=*) for polynomial rates of decay of the eigenvalues of k (a +1 is 
the rate of decay), or O(m-'/2+) for exponential rates of decay (3 is an arbitrary 
positive constant). It would be surprising if we could do any better, given that su- 
pervised learning rates are typically no better than O(m~'/2). In the following, we 
assume that F4 is compact; this is true of all the specific F4 considered above. 


Proposition 17.10 (Learning Rates for Principal Manifolds) Suppose Fa is com- 
pact. Define femp, f* € Fa as in Proposition 17.9. 


1. If InN(e, Fe, d) = O(In® 4) for some a > 0, then 

RE femp] — RLF*] = O(m" n°? m) = O(m) (17.50) 
forany 3 >0 

2. If nN(e, F., d) = O(e~®) for some a > 0, then 

RL fémp] — RIf*] £ O(m) (17.51) 


The proof can be found in Section A.2. A restatement of the optimal learning rates 
in terms of the spectrum of the kernel leads to the following corollary: 
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Corollary 17.11 (Learning Rates for given Spectra) Suppose that Fa is compact, 
femp» f* € Fa are as before, and Aj are the eigenvalues of the kernel k inducing F, (sorted 
in decreasing order). If Aj < el", then 


Rifón] — RLf*1 < O (m. n m) . (17.52) 
If Aj = O(jJ™®) for quadratic regularizers, or Aj = O( j-°/) for linear regularizers, then 


Rl femp] — RLf*] < O (m). (17.53) 


Interestingly, the above result is slightly weaker than the result in [292] for the 
case of length constraints, as the latter corresponds to the differentiation operator, 
thus to polynomial eigenvalue decay of order 2, and therefore to a rate of + (Kégl 
et al. [292] obtain $). For a linear regularizer, though, we obtain a rate of 3. It is 
unclear whether this is due to our bound on the entropy numbers induced by k 
(possibly) not being optimal, or the fact that our results are stated in terms of the 
(stronger) L..(¢4) metric. This weakness, yet to be fully understood, should not 
detract from the fact that we can get better rates by using stronger regularizers, 
and our algorithm can utilize such regularizers. 
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In order to show that the algorithm proposed in section 17.3 is sound, we ran 
several experiments (cf. Figure 17.7, 17.9). In all cases, Gaussian RBF kernels (2.68) 
were used. First, we generated different data sets in 2 and 3 dimensions from 
1 or 2 dimensional parametrizations. We then applied our algorithm, using the 
prior knowledge about the original parametrization dimension of the data set in 
choosing the size of the latent variable space. For almost any parameter setting 
(comprising our choice of A, M, and the width of the basis functions) we obtained 
good results, which means that the parametrization is well behaved. 

We found that for a suitable choice of the regularization factor A, a very close 
match to the original distribution can be achieved. Although the number and 
width of the basis functions also affect the solution, their influence on its basic 
characteristics is quite small. Figure 17.8 shows the convergence properties of the 
algorithm. We observe that the overall regularized quantization error clearly de- 
creases for each step, while both the regularization term and the quantization error 
term are free to vary. This empirically demonstrates that the algorithm strictly de- 
creases Rreg[ f] at every step, and eventually converges to a (local) minimum.? 


9. Rreg[f] is bounded from below by 0, hence any decreasing series of Ryeg[fi], where f; 
denotes the estimate at step i, has a limit that is either a global or (more likely) a local 
minimum. Note that this does not guarantee we will reach the minimum in a finite number 
of steps. 
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Figure 17.7 Upper 4 images: We generated a dataset (small dots) by adding noise to a 
distribution depicted by the dotted line. The resulting manifold generated by our approach 
is given by the solid line (over a parameter range of Z = [—1,1]). From left to right, the 
results are plotted for the regularization parameter values \ = 0.1,0.5,1,4. The width 
and number of basis functions were held constant at 1 and 10, respectively. Lower 4 
images: We generated a dataset by sampling (with noise) from a distribution depicted in 
the leftmost image (the small dots are the sampled data). The remaining three images show 
the manifold obtained by our approach, for A = 0.001,0.1,1, plotted over the parameter 
space Z = [—1, 1]? . The width and number of basis functions were again constant (at 1 and 
36, respectively). 
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Figure 17.8 Left: regularization term, middle: empirical quantization error, right: regular- 
ized quantization error vs. number of iterations. 


Given the close relationship to the GTM, we also applied Regularized Principal 
Manifolds to the oil flow data set used in [52]. This data set consists of 1000 
samples from R'?, organized into 3 classes. The goal is to visualize these samples, 
so we chose the latent space to be 2, = [—1, 1]? (with the exception of Figure 17.10, 
where we embedded the data in 3 dimensions). We then generated the principal 
manifold and plotted the distribution of the latent variables for each sample (see 
Figure 17.9). For comparison purposes, the same strategy was applied to principal 
component analysis (PCA). The result achieved using principal manifolds reveals 
much more of the structure intrinsic to the data set than a simple search for 
directions with high variance. The algorithm output is competitive with [52]. 
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Figure 17.9 Organization of the latent variable space for the oil flow data set. The left 
hand plot depicts the result obtained using using principal manifolds, with 49 nodes, kernel 
width 1, and regularization 0.01. The right hand plot represents the output of principal 
component analysis. The lower dimensional representation found by principal manifolds 
nicely reveals the class structure, to a degree comparable to the GTM. Linear PCA fails 
completely. 


17.7 Summary 


Quantization 
Framework 


Many data descriptive algorithms, such as k-means clustering, PCA, and principal 
curves, can be seen as special instances of a quantization framework. Learning 
is then perceived in terms of being able to represent (in our case, this means to 
compress) data by a simple code, be it discrete or not. 

In deriving a feasible kernel based algorithm for this task, we first showed that 
minimizing the quantization error is an ill-posed problem, and thus requires ad- 
ditional regularization. This led to the introduction of regularized quantization 
functionals that can be solved efficiently in practice. Through the use of manifolds 
as a means of encoding, we obtained a new estimator: regularized principal man- 
ifolds. 

The expansion in terms of kernel functions and the treatment by regularization 
operators made it easier to decouple the algorithmic part (finding a suitable man- 
ifold) from the specification of a class of manifolds with desirable properties. In 
particular, the algorithm does not crucially depend on the number of nodes used. 

Bounds on the sample complexity of learning principal manifolds were given. 
Their proofs made use of concepts from regularization theory and supervised 
learning. More details on bounds involving entropy and covering numbers can 
be found in Chapter 12 and [509]. 

There are several directions for future work using the quantization functional 
approach; we mention the most obvious three. 
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Figure 17.10 Organization of the latent 
variable space for the oil flow data set 
using principal manifolds in 3 dimen- 
sions, with 6° = 216 nodes, kernel width 
1, and regularization A = 0.01. The 3- 
dimensional latent variable space is pro- 
jected onto 2 dimensions for the purpose 
of visualization. Note the good separa- 
tion between the different flow regimes. 
The map further suggests that there exist 
5 subdivisions of the regime labelled +. 


a The algorithm could be improved. In contrast to successful kernel algorithms, 
such as SVMs, the algorithm presented is not guaranteed to find a global mini- 
mum. Is it possible to develop an efficient algorithm that does? 


= The algorithm is related to methods that carry out a probabilistic assignment 
of the observed data to the manifold (see Section 17.4.2). Such a strategy often 
exhibits improved numerical properties, and the assignments themselves can be 
interpreted statistically. It would be interesting to exploit this fact with RPMs. 

a Finally, the theoretical bounds could be improved — hopefully achieving the 
same rate as in [292] for the special case addressed therein, while still keeping the 
better rates for more powerful regularizers. 


17.8 Problems 


17.1 (Sample Mean e) Show that the sample mean is indeed the minimizer of empirical 
quantization function as defined in Example 17.1. 


17.2 (Representer Theorem for RPMs ee) Prove that for a regularized quantization 
functional of the form 


m 


Reslf1 = $, clas, fled) + SUB (17.54) 


=l 
with z; := argmin „eg C(x;, f(Z)), the function at the minimum of (17.54) is given by 


m 


f(z2)= Dy ajk(z;, z) where a; € X. (17.55) 
i=1 


Hint: consider the proof of Theorem 4.2. Now assume that there exists an optimal expan- 
sion that is different from (17.55). Decompose f into f, and fy, and show that f, =0. 


17.8 Problems 
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17.3 (Estimating R[f] eee) Denote by Ryoolf] the leave-one-out error (see also Sec- 
tion 12.2) obtained for regularized quantization functionals. Show that this is an unbiased 
estimator (see Definition 3.8) for Rf]. 

Discuss the computational problems in obtaining Rroolf]. Can you find a cheap ap- 
proximation for Ruoolf]? Hint: remove one pattern at a time, keep the assignment Q; for 
the other observations fixed, and modify the encoding. 

Can you find a sufficient statistic for k-means clustering with squared loss error? Hint: 
express the leave-one-out approximation in closed form and compare it to the inter-cluster 
variance. What do you have to change for absolute deviations rather than squared errors? 


17.4 (Mixed Linear Programming Regularizers ee) Show that minimizing the regu- 
larized risk functional Ryeg[ f] with QJ f] chosen according to (17.17) can be written as a 
quadratic program. Hint: use a standard trick from optimization: replace max(&1,...,&n) 
by the set of inequalities È > £; for i € [n], and use £ in the objective function. 

How does the regularized quantization functional look in the case of regularized princi- 
pal manifolds? 


17.5 (Pearls on a Chain ee) Denote by f : [k] — X a mapping from k numbers to “clus- 
ter centers” in X, where the ith cluster is given by f(i) = Xi ajk(j,i) and z; € IR Can 
you find a regularized quantization functional for clustering? Hint: use a quadratic regu- 
larizer. 

How would you find a suitable algorithm to minimize Rreg[f |? What is the assumption 
made about f when using such a regularizer? Can you find analogous settings for “nets” 
rather than “chains”? Can you modify the regularizer such that the chain becomes more 
and more “stiff” towards the end, thus effectively controlling the capacity? 


17.6 (Coordinate Descent e) Denote by f : RY — R a multivariate function. Prove 
that coordinate descent strictly decreases f at each step. Find a case where this strategy 
can be very slow. Find cases where it converges only to a local minimum. Under what 
circumstances is coordinate descent fast (hint: use a quadratic function in two variables as 
a toy example)? 


17.7 (Covering Numbers in Compact Sets of R? [90] ee) Prove that in d-dimensional 
compact sets S for any metric, the covering number N(e, S) is bounded by O(e~“). Hint: 
compute the volume of the unit ball under the metric, and divide vol S by the volume of 
the unit ball. Exploit scaling properties of volumes. 


17.8 (Convergence Bounds without Bracket Covers 000) Prove uniform convergence 
bounds that do not require bracket covers, but instead make use of the fact that we are only 
comparing manifolds at a finite number of points. 


17.9 (Nodes for the GTM with GP Prior ee) Assume we use the Generative Topo- 
graphic Mapping algorithm with a Gaussian Process Prior on f and a discrete set for 
the possible values of Çi. Show that the minimum of a Maximum a Posteriori estimate is 
achieved if ¢(x;) = ¢;. Hint: apply the representer theorem of Problem 17.2. 
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17.10 (Semiparametric Manifolds ee) Derive optimization equations for a semipara- 
metric regularization functional Q[f] (Section 4.8) in the case of Regularized Principal 
Manifolds. Can you construct an estimator that smoothly blends over from Principal Com- 
ponent Analysis to nonlinear settings, i.e., that depends smoothly on a parameter such that 
for one value, one recovers PCA and for another one RPM. Hint: use hyperplanes in X as 
the parametric part. 

Can you apply the same reasoning to the GTM? Caution: you may obtain improper 
(that is, not normalizable) priors (eee). 


17.11 (Laplacian Priors in the GTM eee) Formulate the posterior probability for a 
Generative Topographic Map, where we use exp(—||Q||1) as the prior probability rather 
than a Gaussian Process prior on the weights. Can you derive the EM equations? 
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Using a kernel k instead of a dot product in the input space X corresponds to 
mapping the data into a dot product space H with a map ®: X > H, and taking 
the dot product there (cf. Chapter 2); 


k(x, x!) = (®(x), D). (18.1) 


This way, all computations in H are done implicitly. The price paid for this ele- 

gance, however, is that the solutions to kernel algorithms are only obtained as 

expansions in terms of input patterns mapped into feature space. For instance, the 

normal vector of an SV hyperplane is expanded in terms of Support Vectors, just 

as the Kernel PCA feature extractors are expressed in terms of training examples; 
m 

pa 5 ai®(x;). (18.2) 
i=1 

When evaluating an SV decision function or a Kernel PCA feature extractor, 
this is normally not a problem: thanks to (18.1), taking the dot product between 
and some mapped test point ®(x) transforms (18.2) into a kernel expansion 
>; ajk(x;,x), which can be evaluated even if Y lives in an infinite-dimensional 
space. In some cases, however, a more comprehensive understanding is required 
of the exact connection between patterns in input space and elements of feature 
space, given as expansions such as (18.2). This field is far from being understood, 
and the current chapter, which partly follows [474], attempts to gather some ideas 
elucidating the problem, and describes algorithms for situations where the above 
connection is important. 

We start by stating the pre-image problem. By this we refer to the problem 
of finding patterns in input space that map to specific vectors in feature space 
(Section 18.1). This has applications for instance in denoising by Kernel PCA 
(Section 18.2). In Section 18.3, we build on the methods for computing single pre- 
images and construct so-called reduced set expansions, which approximate feature 
space vectors. We distinguish between methods that construct the expansions by 
selecting from the training set (Section 18.4), and methods that come up with 
synthetic expansion patterns (Section 18.5). When applied to the solution vector 
of an SVM, these methods can lead to significant increases in speed, which are 
crucial for making SVMs competitive in tasks where speed on the test sets is a 
major concern, such as face detection (Section 18.6). 
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18.1 Pre-Images 18.2 Finding Approximate 
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18.3 Reduced Sets 


18.4 Reduced Set Selection 18.5 Reduced Set 18.6 Sequential Reduced 
Construction Set Evaluation 


The present chapter provides methods that are used in a number of kernel al- 
gorithms; background knowledge on the latter is therefore of benefit. Specifically, 
knowledge of SV classification is required in Sections 18.4 and 18.6, and parts of 
Section 18.4 additionally require knowledge of SV regression and of the basics of 
quadratic programming (cf. Chapter 6). The reader should also be familiar with 
the Kernel PCA algorithm (Section 14.2), which is used both as a tool and a subject 
of study in this chapter: on one hand, it is used for constructing approximation 
methods; on the other hand, it forms the basis of denoising methods, in which it is 
combined it with pre-image construction methods. 


18.1 The Pre-Image Problem 


In Chapter 2, we introduced kernels and described several ways to construct 
the feature map associated with a given kernel; the latter represent ways to get 
from input space to feature space. We now study maps that work in the opposite 
direction. 

There has been a fair amount of work on aspects of this problem in the context 
of reduced set (RS) methods [84, 87, 184, 400, 474]. For pedagogical reasons, we 
postpone RS methods to Section 18.3, as they focus on a problem that is more 
complex than the one we would like to start with. 


18.1.1 Exact Pre-Images 


Kernel algorithms express their solutions as expansions in terms of mapped input 
points (18.2). Since the map ® into the feature space H is nonlinear, however, we 
cannot generally assert that each such expansion has a pre-image under ®; namely 
a point z € X such that ®(z) = Y (Figure 18.1). If the pre-image exists, then it will 
be easy to compute, as shown by the following result: 


Proposition 18.1 (Exact Pre-Images [467]) Consider a feature space expansion Y = 
Èj- &j®(x;), where x; € X. We assume that X is a subset of RN. If there exists z € RN 
such that 


(z) = Y, (18.3) 
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Figure 18.1 The pre-image 


= ® 
H= span DX) problem: points in the span 
P of the mapped input data are 
x a TNN not all necessarily images of 
P(X) corresponding input patterns. 


Therefore, points that can be 
written as expansions in terms 


F 
Y | 
° of mapped input patterns (such 
as a Kernel PCA eigenvector 
or a SVM hyperplane normal 
vector) cannot necessarily be 


expressed as images of single 
input patterns. 


and an invertible function fy such that k(x, x’) = f,({x,x’)), then we can compute z as 


z= Sf (5 aktepe) ĉi, (18.4) 
i=1 j=l 


where {e,...,en} is any orthonormal basis of input space. 


Proof We expand z as 


N N N m 
z=} @ea=> fp (ke,e) = X, fr’ (È os.) ei. (18.5) 
= = =] = 
a 


This proposition gives rise to a number of observations. First, examples of kernels 
that are invertible functions of (x, x’) include polynomial kernels, 


k(x, x’) = ((x, x’) +c)", where c > 0, d odd, (18.6) 
and sigmoid kernels, 
k(x, x’) = o(k- (x, x’) +0), where K,O € R. (18.7) 


A similar result holds for RBF kernels (using the polarization identity) — we only 
require that the kernel allow the reconstruction of (x,x’) from k, evaluated on 
certain input points, which we are allowed to choose (for details, cf. [467]). 

The crucial assumption of the Proposition is the existence of the pre-image. 
Unfortunately, there are many situations for which there are no pre-images. To 
illustrate, we consider the feature map ® in the form given by (2.21): ®: X —> 
R*,x + k(.,x). Clearly, only points in feature space that can be written as k(., x) 
have a pre-image under this map. To characterize this set of points in a specific 
example, consider the Gaussian kernels, 


k(x, x!) = exp (e1) ’ (18.8) 


2 o? 
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In this case, ® maps each input to a Gaussian centered on this point (see Fig- 
ure 2.2). We already know from Theorem 2.18, however, that no Gaussian can be 
written as a linear combination of Gaussians centered at different points. There- 
fore, in the Gaussian case, none of the expansions (18.2), excluding trivial cases 
with only one term, has an exact pre-image. 


18.1.2 Approximate Pre-Images 


The problem we initially set out to solve has turned out to be insolvable in the 
general case. Consequently, rather than trying to find exact pre-images, we attempt 
to obtain approximate pre-images. We call z € R an approximate pre-image of ¥ if 


p(2) = |Y — &(2)||? (18.9) 


is small.! 

Are there vectors ¥ for which good approximate pre-images exist? As we shall 
see, this is indeed the case. As described in Chapter 14, for n = 1,2,..., p, Kernel 
PCA provides projections 

n . i 
P,®(x) := $ (®(x), vi v (18.10) 
j=1 
with the following optimal approximation property (Proposition 14.1): Assume 
that the v/ are sorted according to nonincreasing eigenvalues Aj, with Ap being the 
smallest nonzero eigenvalue. Then P, is the n-dimensional projection minimizing 


m 


¥ [P Ex) — Dx). (18.11) 
i=1 


Therefore, P „®(x) can be expected to have a good approximate pre-image, pro- 
vided that x is drawn from the same distribution as the x;; to give a trivial example, 
x itself is already a good approximate pre-image. As we shall see in experiments, 
however, even better pre-images can be found, which makes some interesting ap- 
plications possible [474, 365]: 


Denoising. Given a noisy x, map it to ®(x), discard components corresponding 
to the eigenvalues \,41,...,Am to obtain P,,®(x), and then compute a pre-image 
z. The hope here is that the main structure in the data set is captured in the first 
n directions in feature space, and the remaining components mainly pick up the 
noise — in this sense, z can be thought of as a denoised version of x. 


Compression. Given the Kernel PCA eigenvectors and a small number of features 
P,®(x) (cf. (18.10)) of B(x), but not x, compute a pre-image as an approximate 
reconstruction of x. This is useful if n is smaller than the dimensionality of the 
input data. 


1. Just how small it needs to be in order to form a satisfactory approximation depends on 
the problem at hand. Therefore, we have refrained from giving a formal definition. 
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Figure 18.2 Given a vector ¥ € H, we 
try to approximate it by a multiple of 
a vector ®(z) in the image of the input 
space (IR) under the nonlinear map ®, 
by finding z such that the projection dis- 
tance of Y onto span(®(z)), depicted by 
the straight line, is minimized. 


Interpretation. Visualize a nonlinear feature extractor v/ by computing a pre- 
image. 


In the present chapter, we mainly focus on the first application. In the next section, 
we develop a method for minimizing (18.9), which we later apply to the case 
where Y = P,,®(x). 


18.2 Finding Approximate Pre-Images 
18.2.1 Minimizing the Projection Distance 


We start by considering a problem slightly more general than the pre-image prob- 
lem. We are given a kernel expansion with N, € N terms, 
Nx 
Y= > a(x), (18.12) 
i=1 
and seek to approximate it by ‘¥’ = 3@(z). For 8 = 1, this reduces to the pre-image 
problem. Allowing the freedom that 3 4 1 makes sense, since the length of is 
usually not crucial: in SV classification, for instance, it can be rescaled (along with 
the threshold b) without changing the decision function, cf. Chapter 7. 
First observe that rather than minimizing 


Nx 

tev]? = S aia jk(xi, x) + B7k(z, z) — 2 X, aibk(xi, z), (18.13) 
i, j=1 i=1 

we can minimize the distance between Y¥ and the orthogonal projection of Y onto 

span(®(z)) (Figure 18.2), 


_(P,®(Z)) on (¥, ®(2))” 
leaage- = IMP a a 
To this end, we maximize 
2 
CP, ®(z)) (18.15) 


(®(z), Pz)’ 
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which can be expressed in terms of the kernel. The maximization of (18.15) over 
z is preferable to that of (18.13) over z and £, since the former comprises a lower- 
dimensional problem, and since z and 6 have different scaling behavior. Once the 
maximum of (18.15) is found, it is extended to the minimum of (18.13) by setting 
(cf. (18.14)) 8 = (¥, ®(z)) / (®(z), P(z)) (Problem 18.6). The function (18.15) can 
either be minimized using standard techniques for unconstrained nonlinear opti- 
mization (as in [84]), or, for particular choices of kernels, using fixed-point itera- 
tion methods, as shown below. Readers who are not interested in the algorithmic 
details may want to skip Section 18.2.2, which shows that for a certain class of ker- 
nels, the pre-image problem can be solved approximately using an algorithm that 
resembles clustering methods. 


18.2.2 Fixed Point Iteration Approach for RBF Kernels 


For kernels that satisfy k(z, z) = 1 for all z € X (such as Gaussian kernels and other 
normalized RBF kernels), (18.15) reduces to 


(P,®(z))°. (18.16) 
Below, we assume that X C R". For the extremum, we have 
0 = V: (Y, (z))? = 2 (P, O(z)) Vz (, O(2)). (18.17) 


To evaluate the gradient in terms of k, we substitute (18.12) to get 


Nx 
0= $ aiV2K(xi, 2), (18.18) 


i=1 
which is sufficient for (18.17) to hold. 
For k(x;, z) = k(||x; — z||?) (Gaussians, for instance), we obtain 


= ï ak’ (\|x; — z||?)(xi — 2), (18.19) 


i=1 


k' being the derivative of k, leading to 


Nx ik i— 2 i 
ga Dies aa (18.20) 
a aik’ (||x; = z|I?) 


For the Gaussian kernel k(x;, z) = exp(—||x; — z||?/(207)), we get 


Nx : ieee 2 2 : 
"a Ziz ai exp(=|[xi — z| /20)xi (18.21) 
Lia, 1 exp(—[|xi — Z|? /207)) 


(note that for k(t) = exp(at), we have k’(t) = aexp(at)), and devise an iteration 
Sin a exp(— [li = Zul? /Qo?))x 


7 18.22 
SSS aop lle — zrl) ma 


Zn+1 = 
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Connection to 
Clustering 


The denominator equals (‘¥ , ®(z,,)), and is thus nonzero in the neighborhood of the 
extremum of (18.16), unless the extremum itself is zero. The latter only occurs if 
the projection of ¥ on the linear span of ®(IRY) is zero, in which case it is pointless 
to try to approximate ¥. Numerical instabilities related to (Y,®(z)) being small 
can thus be approached by restarting the iteration with different starting values. 

Interestingly, (18.22) can be interpreted in the context of clustering (e.g., [82]). 
Iteration of this expression determines the center of a single Gaussian cluster, 
trying to capture as many of the x; with positive a; as possible, and simultaneously 
avoids those x; with negative a;. For SV classifiers, the sign of the a; equals 
the label of the pattern x;. It is this sign which distinguishes (18.22) from plain 
clustering or parametric density estimation. The occurrence of negative signs is 
related to the fact that we are not trying to estimate a parametric density but the 
difference between two densities (neglecting normalization constants). 

To see this, we define the sets pos = {i : a; > 0} and neg = {i : a; < 0}, and the 
shorthands 


Ppos(Z) = X ag exp(—||xi — z|? /(20°)) (18.23) 
pos 

and 

Preg(2) = >, lailexp(—|lxi — z||?/(20)). (18.24) 
neg 


The target (18.16) then reads (p pos(z) — Pneg(2)}?; in other words, we are trying to 
find a point z for which the difference between the (unnormalized) “probabilities” 
of the two classes is maximized, and to estimate the approximation of (18.12) by a 
Gaussian centered at z. Furthermore, note that we can rewrite (18.21) as 


Ppos(Z) Pneg(Z) 
=F pos ness 18.25 
Ppos(Z) — Pneg(Z) p Pneg(Z) — Ppos(Z) J ( ) 
where 
È pos /neg Qi exp(—||x; = z| /(20°))x; 


Xnos/nee = Wa OO eT 18.26 
as % pos /neg Qi Exp(—||x;— z||?/@o?)) ( ) 


18.2.3 Toy Examples 


Let us look at some experiments, for which we used an artificial data set generated 
from three Gaussians (standard deviation o = 0.1). Figure 18.3 shows the results of 
performing kernel PCA on this data. Using the resulting eigenvectors, nonlinear 
principal components were extracted from a set of test points generated from the 
same model, and the points were reconstructed from varying numbers of principal 
components. Figure 18.4 shows that discarding higher-order components leads to 
removal of the noise — the points move towards their respective sources. 

To obtain further intuitive understanding in a low-dimensional case, Figure 18.6 
depicts the results of denoising a half circle and a square in the plane, using Kernel 
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Figure 18.3 Kernel PCA toy 
example with a Gaussian ker- 
nel (see text). We plot lines 
of constant feature value for 
the first 8 nonlinear princi- 
pal components extracted with 
k(x, x!) = exp (—||x — x'||?/0.1). 
The first 2 principal compo- 
nents (top middle/right) sep- 
arate the three clusters, com- 
ponents 3-5 split the clusters, 
and components 6-8 split them 
again, orthogonal to the above 
splits [474]. 


T, Figure 18.4 Kernel PCA de- 
noising by reconstruction from 
“he projections onto the eigenvec- 
. l tors of Figure 18.3. We gener- 
ated 20 new points from each 
Gaussian, represented them in 
feature space by their first n = 
P P 2 1,2,...,8 nonlinear principal 
! ' components, and computed ap- 
proximate pre-images, shown 
in the upper 9 pictures (top 
left: original data, top mid- 
; a dle: n = 1, top right: n = 2, 
: = a i etc.). Note that by discarding 
higher order principal compo- 
nents (through using a small 
“Ee Mm K n), we removed the noise in- 
PA herent in the nonzero variance 
a X X X X o° of the Gaussians. The lower 
9 pictures show how the orig- 
inal points “moved” in the de- 
& pa fhe noising. Unlike the correspond- 
ing case in linear PCA, where 
* z ` Ai Se X where we obtain lines (see Fig- 
T a ure 18.5), clusters shrink to 

points in Kernel PCA [474]. 


Figure 18.5 Reconstructions 

; ANI and point movements for lin- 
ae A ear PCA, based on the first 
a ny N principal component [474]. 
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Kernel PCA Principal Curves linear PCA 


Figure 18.6 Denoising in 2-d (see text). The data set (small points) and its denoised version 
(solid lines) are depicted. For linear PCA, we used one component for reconstruction, 
since two components yield a perfect reconstruction, and thus do not denoise. Note that 
all algorithms except for the KPCA pre-image approach have problems in capturing the 
circular structure in the bottom example (from [365)]). 


PCA (with Gaussian kernel), a nonlinear autoencoder (see Section 14.2.3), prin- 
cipal curves (Example 17.6), and linear PCA. In all algorithms, parameter val- 
ues were selected such that the best possible denoising result was obtained. The 
figure shows that on the closed square problem, Kernel PCA does best (subjec- 
tively), followed by principal curves and the nonlinear autoencoder; linear PCA 
fails completely. Note however that all algorithms other than Kernel PCA provide 
an explicit one-dimensional parametrization of the data, whereas Kernel PCA only 
provides us with a means of mapping points to their denoised versions (in this 
case, we used four Kernel PCA features, and hence obtained a four-dimensional 
parametrization). 


18.2.4 Handwritten Digit Denoising 


The approach has also been tested on real-world data, using the USPS database of 
handwritten digits (Section A.1). For each of the ten digits, 300 training examples 
and 50 test examples were chosen at random. Results are shown in Figures 18.7 
and 18.8. In the experiments, linear and Kernel PCA (with Gaussian kernels) were 
performed on the original data. Two types of noise were added to the test patterns: 


(i) Additive Gaussian noise with zero mean and standard deviation a = 0.5 
(ii) ‘Speckle’ noise, where each pixel is flipped to black or white with probability 
p=0.2 


For the noisy test sets, the projections onto the first n linear and nonlinear com- 
ponents were computed, and reconstruction was carried out in each case (using a 
basis expansion in the linear case, and the pre-image method in the kernel case). 
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Gaussian noise ‘speckle’ noise 


Figure 18.7 Denoising of USPS data (see text). We first describe the left hand plot. Top: the 
first occurrence of each digit in the test set. Second row: the digit above, but with added 
Gaussian noise. Following five rows: the reconstruction achieved with linear PCA using 
n = 1,4,16, 64,256 components. Last five rows: the results of the Kernel PCA approach using 
the same number of components. In the right hand plot, the same approaches are illustrated 
for ‘speckle’ noise (from [365]). 


When the optimal number of components was used in both linear and Kernel 
PCA, the Kernel PCA approach did significantly better. This can be interpreted as 
follows: Linear PCA can extract at most N components, where N is the dimension- 
ality of the data. Being a basis transform, all N components together fully describe 
the data. If the data are noisy, a certain fraction of the components are devoted to 
the extraction of noise. Kernel PCA, on the other hand, allows the extraction of 
up to m features, where m is the number of training examples. Accordingly, Ker- 
nel PCA can provide a larger number of features carrying information about the 
structure in the data (in our experiments, m > N). In addition, if the structure to 
be extracted is nonlinear, then linear PCA must necessarily fail, as demonstrated 
in the toy examples. 


18.3 Reduced Set Methods 


18.3.1 The Problem 


In the MNIST benchmark data set of 60000 handwritten digits, SVMs have 
achieved record accuracies (Chapter 11); they are inferior to neural nets in run- 
time classification speed, however [87]. In applications for which the latter is an 
issue, it is thus desirable to come up with methods to increase the speed by making 


18.3 Reduced Set Methods 553 
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the SV expansion more sparse (cf. Chapter 10). This constitutes a problem slightly 
more general than the pre-image problem studied above: we no longer just look 
for single pre-images, but for expansions in terms of several input vectors [84]. It 
turns out that we can build on the methods developed in the previous section to 
design an algorithm for this more general case (see Section 18.5). 

Assume we are given a vector Y € H, expanded in terms of the images of input 
patterns x; € X, 


Nx 
Y=) oP(x), (18.27) 
i=l 
with expansion coefficients a; € R. Rather than looking for a single pre-image, we 
try to approximate by a reduced set expansion [84], 


Y= Š pi@(zi), (18.28) 
i=1 


with N; < Nx, bi € R and points z; € X. To this end, it was suggested in [84] that 


N. N- Nx Nz 
IP -Y| = ¥ aiajk(xi,x) + $, BiBjMzi,z))-2>¥, ¥, aiBjk(x:,z;) (18.29) 


ij=1 ij=1 i=1 j=1 


be minimized. The crucial point is that even if ® is not given explicitly, (18.29) can 
be computed (and minimized) in terms of the kernel. 


18.3.2 Finding the Coefficients 


Evidently, the RS problem consists of two parts: the determination the RS vectors 
Zi, and the computation the expansion coefficients (;. We start with the latter, as it 
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is both easier and common to different RS methods. 


Proposition 18.2 (Optimal Expansion Coefficients [474]) Suppose that the vectors 
@(Z1),---, P(Zw) are linearly independent. The expansion coefficients B = (1,..., Bm’) 
minimizing 

m m 


Y ai®(x;) — ¥ bi®(z:) (18.30) 
i=1 i=1 


are given by 

B= (K) Ka, (18.31) 
where Kj, := (®(z;), ®(z;)) and KẸ = (®(z;), P(x))). 

Note that if the ®(z;) are linearly independent, as they should be if we want to use 


them for approximation, then K* has full rank. Otherwise, we can use the pseudo- 
inverse, or select the solution that has the largest number of zero components. 


Proof We evaluate the derivative of the distance in K, 


al z 3 Bo = —20(2))( — ¥ aoe, (18.32) 

and set it to 0. Substituting Y = >", a,P(x;), we obtain (using œ = (a1,..., Am)) 

Ka = KB, (18.33) 

hence 

B = (K) Ka. (18.34) 
E 


No RS algorithm using the feature space norm as an optimality criterion can 
improve on this result (cf. also Section 10.2.1). For instance, suppose we are given 
an algorithm that computes the 8; and z; simultaneously. Proposition 18.2 can 
then be used to recompute the optimal coefficients 3, which must yield a solution 
at least as good. Algorithms may still be differentiated, however, by the way in 
which they determine the vectors z; in the first place. In the next section, we 
describe algorithms that simply select subsets of the x;, whereas methods detailed 
in Section 18.5 use vectors that can be different to the original x;. 


18.4 Reduced Set Selection Methods 


Why should we expect to gain anything by selecting a subset of the expansion 
vectors in order to get a sparser expansion? Indeed, doesn’t the SVM algorithm 
already find the sparsest solution? Unfortunately, this is not the case. Since the 
coefficients in an SVM expansion satisfy a; € [—C,C] (cf. Chapter 7) for some 
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positive value of the regularization constant C, there is reason to believe that the 
SV expansion can be made sparser by removing this constraint on a; [400].2 


18.4.1 RS Selection via Kernel PCA 


The idea for the first algorithm we describe arises from our observation that the 
null space of the Gram matrix Kj; = (®(x;),®(x;)) tells us precisely how many 
vectors can be removed from an expansion while incurring zero approximation 
error (assuming we correctly adjust the coefficients). In other words, it tell us 
how sparse we can make a kernel expansion without changing it in the least 
[467, 184].° Interestingly, it turns out that this problem is closely related to Kernel 
PCA (Chapter 14). 

Let us start with the simplest case. Assume there exists an eigenvector a # 0 of 
K with eigenvalue 0, hence Ka = 0. Using Kj; = (®(x;), ®(x;)), this reads 


m 


Ş (®(x;), (x;)) a; =0 for alli=1,...,m, (18.35) 
j=l 

hence 

> aj®(x;) =0. (18.36) 
j=l 


Since a + 0, the (xj) are linearly dependent, and therefore any of the (xj) 
with nonzero a; can be expressed in terms of the others. Hence we may use the 
eigenvectors with eigenvalue 0 to eliminate certain terms from any expansion in 
the B(x). 

What happens if we do not have vanishing eigenvalues, as in the case of Gaus- 
sian kernels (Theorem 2.18)? Intuitively, we anticipate that even though the above 
reasoning is no longer holds precisely true, it should give a good approxima- 
tion. The crucial difference, however, is that in order to get the best possible 
approximation, we need to take into account the coefficients of the expansion 
Y= X aj®(x;): if we incur an error by removing ®(x,), for example, then this 
error also depends on &n„. How do we then select the optimal n? 

Clearly, we would like to find coefficients 8 j that minimize the error we commit 
by replacing an®(Xn) with X; BjP(xj); 


AB:n) = |an- Yj, OCH) | - (18.37) 


To establish a connection to Kernel PCA, we make a change of variables. First, 


2. For instance, a certain pattern might appear twice in the training set, yet the SV expan- 
sion must utilize both copies since the upper bound constraint limits the coefficient of each 
to C. 

3. The Gram matrix can either be computed using only those examples that have a nonzero 
aj, or from a larger set containing further candidate expansion vectors. 
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define nj = 1 for j = n and nj = —8;/an for j # n. Hence the error (18.37) equals 
lan? || Ei nj®(x;)||?. Normalizing n to obtain y := 7/||n||, and thus yn = 1/||n||, 
leads to the problem of minimizing 


2 
a 
p(y, n) = a (y, Kq) (18.38) 
n 
with respect to y, where ||y|| = 1 (note that p is invariant when y is rescaled). 


A straightforward calculation shows that we can recover the approximation co- 
efficients for a,®(x,) (namely, the values that are added to the aj (j # n) when 
An®(xn) is left out): these are 3; = =n; jÆn. 

Rather than minimizing the nonlinear function (18.38), we now devise a com- 
putationally attractive approximate solution. This approximation is motivated by 
the observation that (y, Ky) alone is minimized for the eigenvector with minimal 
eigenvalue, consistent with the special case discussed above (cf. (18.36)). In this 
case, (y, KY} = Amin: More generally, if 4f is a normalized eigenvector of K with 
eigenvalue A;, then 

, Qn 5 
p(i,n) = ae Aje (18.39) 


n 


This can be minimized in O(m*) operations by performing kernel PCA and scan- 
ning through the matrix (p(i,1))in. The complexity can be reduced to O(m'm?) by 
only considering the smallest m’ eigenvalues, with m’ < m chosen a priori. Hence, 
we can eliminate ®(x,,), where n is chosen in a principled yet efficient way. 

Setting all computational considerations aside, the optimal greedy solution to the 
above selection problem, equivalent to (18.38), can also be obtained using Propo- 
sition 18.2: we compute the optimal solution for all possible patterns that could be 
left out (that is, we use subsets of {x1,. . ., Xm} of size m — 1 as {Z1,...,Zm}) and 
evaluate (18.32) in each case. 

The same applies to subsets of any size. If we have the resources to exhaustively 
scan through all subsets of size m’ (1 < m’ < m — 1), then Proposition 18.2 provides 
the optimal way of selecting the best expansion of a given size. Better expansions 
can only be obtained if we drop the restriction that the approximation be written 
in terms of the original patterns, as done in Section 18.5. 

No matter how we end up choosing n, we approximate ¥ by 


Y= J aj®(xj) + an@(en) & Y (a ae 
jan jn Yn 

The whole scheme can be iterated until the expansion of Y is sufficiently sparse. 
If we want to avoid having to find the smallest eigenvalue of K anew at each step, 
then approximate schemes using heuristics can be conceived. 

We now describe experiments conducted to demonstrate this set reduction 
method. We determined the eigenvectors at each step using the Gram matrix com- 
puted from the SVs, and n was selected according to (18.39). For the USPS hand- 
written digit database, approximations were found to the SV expansions (7.25) of 


) (x). (18.40) 
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Table 18.1 Number of test errors for each binary recognizer and test error rates for 10-class 
classification, using the RS method that selects patterns via Kernel PCA (Section 18.4.1). 
Top: number of SVs in the original SV RBF-system. Bottom, first row: original SV system, 
with 254 SVs on average; following rows: systems with varying average numbers of RS 
patterns. In the system RSS-n, equal fractions of SVs were removed from each recognizer 
such that on average, n RS patterns were left. 


digste | of 1] 2] 3] 4| 5{ 6] 7| 8] 9|  ave| 


Pasvs [p29 [a1 | aie | 300 | as | 3H [213 | 206 | 504 | 250 EE 
DS e e o 
Sa p e eo o a e a e a o 
RSS-50 

RSS-75 

RSS-100 


RSS-150 
RSS-200 
RSS-250 


ten binary classifiers, each trained to separate one digit from the rest. A Gaussian 
kernel k(x, x’) = exp(—||x — x’||?/(0.5 - 16?)) was used. The original SV system has 
on average 254 SVs per classifier. Table 18.1 shows the classification error results 
for varying numbers of RS patterns (RSS-n means that for each binary classifier, 
SVs were removed until on average n were left). On each line, the number of mis- 
classified digits for each single classifier is shown, as is the error of the combined 
10-class machine. The optimal SV threshold b was re-computed on the training set 
after the RS approximation was found. 


18.4.2 RS Selection via 44 Penalization 


We next consider a method for enforcing sparseness inspired by 44 shrinkage 
penalizers (cf. Chapter 3), following the discussion in [474]. In a way, this allows 
us to benefit from the effect of ¢; penalizers even if we do not want to use an ¢; 
term as a regularizer, as was the case in LP-machines (Section 7.7). 

Given an expansion >; a;®(x;), we approximate it by >; 3;®(x;) through the 
minimization of 


m m 


Y a(x) - F nal 
=i =] 


m 


+ AD ci|Gi| (18.41) 


over all /3;. Here, A > 0 is a constant determining the trade-off between sparseness 
and the quality of approximation. The constants c; can for instance be set to 1 or 
a/|a;| (where a is the mean of all |a;|). In the latter case, we hope for a sparser 
decomposition, since more emphasis is put on shrinking terms that are already 
small. This reflects the intuition that it is less promising to try to shrink very large 
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terms. Ideally, we would like to count the number of nonzero coefficients, rather 
than sum their moduli; the former approach does not lead to an efficiently solvable 
optimization problem, however. 

To dispose of the modulus, we rewrite /3; as 


Bi:= BF — br, (18.42) 


where 67 > 0. In terms of the new variables, we end up with the quadratic 
programming problem 


minimize DUAF — 8 MB} ~ 67)K; (18.43) 
, ij 


+ > (s (x E 25K) + A (x + a) 
j i i 
subject to 


This problem can be solved using standard quadratic programming tools. The 
solution (18.42) could be used directly as expansion coefficients. For optimal pre- 
cision, however, we merely use it to select which patterns to use in the expansion 
(those with nonzero coefficients), and re-compute the optimal coefficients accord- 
ing to Proposition 18.2. 

In many applications, we face the problem of simultaneously approximating a 
set of M feature space expansions. For instance, in digit classification, a common 
approach is to train M = 10 binary recognizers, one for each digit. To this end, 
the quadratic programming formulation (18.43) can be modified to find all M 
expansions simultaneously, encouraging the use of the same expansion patterns 
in more than one binary classifier [474]. 

These algorithms were evaluated on the same problem considered in the previ- 
ous section. As with the results in Table 18.1, Table 18.2 shows that the accuracy 
of the original system can be closely approached by selecting sufficiently many RS 
patterns. The removal of about 40% of the Support Vectors leaves the test error 
practically unchanged. 


18.4.3 RS Selection by Sparse Greedy Methods 


Another set of methods were recently proposed to select patterns from the training 
set ([603, 514, 503], cf. also [580]). The basic idea, described in Section 10.2, is to start 
with an empty expansion, and greedily select the patterns that lead to the smallest 
error in approximating the remaining patterns. The resulting algorithms are very 
computationally efficient and have led to fine results. 

It bears mentioning, however, that in many cases, with Gaussian Process re- 
gression being the exception (see Section 16.4 for more details), they do not take 
into account the expansion coefficients a; of the original feature space vector, and 
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Table 18.2 Number of test errors for each binary recognizer and test error rates for 10- 
class classification, using the RS method employing ¢; penalization (Section 18.4.2) (where 
ci = a / |a;l). First row: original SV system, with 254 SVs on average; following rows: systems 
with varying average numbers of RS patterns. In the system RSS2-n, à was adjusted such 
that the average number of RS patterns left was n (the constant À, given in parentheses, was 
chosen such that the numbers n were comparable to Table 18.1). The results can be further 
improved using the multi-class method cited in Section 18.4.2. For instance, using about 570 
expansion patterns (which is the same number that we get when taking the union of all SVs 
in the RSS; — 74 system) led to an improved error rate of 5.5%. 


fae sd 7 7] 2, 3] 4] 8] *] 7] 8] 9] wa] 
sæ ED r e ep e ae 
RSS2-50 (3.34) 
RSS2-74 (2.55) 


RSS,-101 (1.73) 
RSS2-151 (0.62) 
RSS2-200 (0.13) 
RSS,-234 (0.02) 


thus do not always produce optimal results when the task is to approximate one 
given vector. Indeed, they do not even start from an original solution vector, as 
they compute everything in one pass. Therefore, they are not strictly speaking re- 
duced set post-processing methods; we might equally well consider them to be sparse 
training algorithms. These algorithms are dealt with in Section 10.2. 

On the other hand, if the effective dimensionality of the feature space is low, as 
is manifest in a rapid decay of the eigenvalues of the kernel matrix K, a sparse 
approximation of just about any function with small RKHS norm will be possible. 
In this case, which is rather frequent in practice, sparse approximation schemes 
which find a reduced set of expansion kernels a priori will perform well. See 
Section 10.2 for examples. 


18.4.4 The Primal Reformulation 


As discussed in Section 18.4, one of the reasons that SV expansions are not usually 
as sparse as they should be is the restriction of their coefficients to the interval 
[—C, C]. This problem derives directly from the structure of the quadratic program 
(QP) used to train an SVM. It is not the only problem caused by the QP, however. 

A more fundamental reason for the SVM solution not being the sparsest possible 
is that the set of SVs contains all the information necessary to solve the classification task, 
as discussed in Section 7.8.2. This is a rather severe constraint, and is not enforced 
in other kernel-based classifiers such as Relevance Vector Machines (Chapter 16). 
It is partly this constraint that prevents SVMs from returning sparser solutions in 
the first place. 

In an attempt to address this problem, Osuna and Girosi [400] proposed what 
they call the primal reformulation of the original SVC training problem (7.35). In 
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their modification, they substitute the SV expansion w = >); a;y;P(x;) into (7.35) to 
obtain: 


1 m m 
= ik ; 
pe x aiajyiy jki xj) + — D (18.45) 
m 
subject to y; | X ajyjk(xj,x)) +b] >1-& fori=1,...,m. (18.46) 
i=1 
This formulation uses positivity constraints on a; and é; — the inequalities in 


(18.45) are to be read component-wise. We observe, however, that this formulation 
no longer requires that a; < C. 

Note, moreover, that the optimization runs over a, €, and b, and that the con- 
straints (18.46) are linear in these variables; thus, we are still dealing with a QP. Its 
structure is not as appealing as that of the original form from an implementation 
viewpoint, however. The more complicated constraints make it harder to come up 
with algorithms that can solve large problems. 

Although there is no a priori guarantee that this formulation will give sparser 
expansions; the method was nonetheless successfully applied to several small real- 
world problems [400]. The recommended starting point for the optimization is 
a = 0; other tricks for encouraging sparseness include the use of an 4 penalty 
term as in (18.4.2). 


18.4.5 RS Selection via SV Regression 


A second approach proposed by [400], which is particularly appealing in terms of 
its simplicity, uses SV regression to find RS vectors for SV classifiers. The idea is 
to apply ¢-SV regression to the data set generated by evaluating the real-valued 
argument of the decision function, 


g(x) = ¥ aiyik(x;,x) +b, (18.47) 
on the SVs; in other words, to 
{x gala; > 0,i=1,..., m}. (18.48) 


If the SVR training uses a large value of C (cf. Chapter 9), then (almost) all data 
points should be approximated within the accuracy € set by the user. Therefore, 
the solution of the SVR algorithm has the same form as g (18.47) and can be used 
as a drop-in replacement for g. Since the SVR solution does not typically use all 
training examples (18.48), it is usually sparser than g. Note that when combined 
with the v-SV regression algorithm (Section 9.3), this approach allows a rather 
more direct control of the size of the RS expansion (cf. Proposition 9.2). 
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18.5 Reduced Set Construction Methods 


So far, we have considered the problem of how to select a reduced set of expansion 
vectors from the original set. We now return to the problem posed initially, which 
includes the construction of new vectors to reach high set size reduction rates. 


18.5.1 Iterated Pre-Images 


Suppose we want to approximate a vector 


m 


Y = ¥ ajP(x)) (18.49) 
i=l 


with an expansion of type (18.28) with N- > 1 and z; € R" .4 To this end, we iterate 
a procedure for finding approximate pre-images (in the case of Gaussian kernels, 
for instance, this procedure is described in Section 18.2.2). This means that in step 
m’, we need to find a pre-image Zm of 


m m1 


Yw = ¥ a @(xi) — X, G;®(z)). (18.50) 
Ei = 


The coefficients are updated after each step according to Proposition 18.2 (if the 
discrepancy ‘¥41 has not yet reached zero, then K* is invertible). 

The iteration is stopped after N, steps; a number either specified in advance, 
or obtained by checking if ||%n41|| (which equals ||¥1 — ¥/”; 5;®(z;)||) has fallen 
below a specified threshold. The solution vector takes the form (18.28). A toy 
example, using the Gaussian kernel, is shown in Figure 18.9. 

For other kernel types, such as the polynomial kernels, we need to use a different 
procedure for computing approximate pre-images. One way of dealing with the 
problem is to minimize (18.15) directly. To achieve this, we can use unconstrained 
nonlinear optimization techniques, as proposed by Burges [84]. 

Finally, note that in many cases, such as multiclass SVMs and multiple Kernel 
PCA feature extractors, we may actually want to approximate several vectors si- 
multaneously. This leads to rather more complex equations; cf. [474] for a discus- 
sion. 


18.5.2 Phase II: Simultaneous Optimization of RS Vectors 
Once all individual pre-images have been computed in this way, we could go 


ahead and apply the resulting RS expansion. It turns out, however, that it is still 
possible to improve things using a second phase, in which we simultaneously 


4. Note that reduced set selection methods can work on general inputs. The present re- 
duced set construction method, on the other hand, needs vectorial data, as it relies on form- 
ing linear combinations of patterns. 
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Figure 18.9 Toy example of the reduced set construction approach obtained by iterating 
the fixed-point algorithm of Section 18.2.2. The first image (top left) shows the SVM decision 
boundary that we are trying to approximate. The following images show the approxima- 
tions using N, = 1,2,4,9,13 RS vectors. Note that already the approximation with N, = 9 
(bottom middle) is almost identical to the original SVM, which has 31 expansion vectors (SVs) 
(from [437]). 


optimize over all (z;, ĝ8;). Empirical observations have shown that this part of the 
optimization is more computationally expensive than the first phase, by about 
two orders of magnitude [84, 87]. In addition to this high computational cost, it 
is numerically difficult to handle: the optimization needs to be restarted several 
times to avoid getting trapped in local minima. At the end of phase II, it is 
advisable to recompute the 6; using Proposition 18.2. Let us now take a look at 
some experiments to see how well these methods do in practice. 


18.5.3 Experiments 


Table 18.3 shows that the RS construction method performs better than the RS 
selection methods described above (Tables 18.1 and 18.2). This is because it is able 
to utilize vectors different from the original support patterns in the expansion. 
To speed up the process by a factor of 10, we have to use a system with 25 
RS vectors (RSC-25). We observe in Table 18.3 that the classification accuracy 
only drops moderately as a result, from 4.4% to 5.1%, which is still competitive 
with convolutional neural networks on this database (Table 7.4). In addition, 
we can further improve the system by adding the second phase described in 
Section 18.5.2, in which a global gradient descent is performed in the space of all 
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MNIST 
Experiments 


Table 18.3 Number of test errors for each binary recognizer and test error rates for 10- 
class classification, using the RS construction method of Section 18.5. First row: original SV 
system, with 254 SVs on average (see also Tables 18.1 and 18.2); following rows: systems 
with varying numbers of RS vectors (RSC-n stands for n vectors constructed) per binary 
recognizer, computed by iterating one-term RBF approximations (Section 18.5.1) separately 
for each recognizer. Last row: a subsequent global gradient descent (Section 18.5.2) further 
improves the results, as shown here for the RSC-25 system (see text). 


A e 
RSC-10 
RSC-20 
RSC-25 
RSC-50 


RSC-100 
RSC-150 
RSC-200 
RSC-250 


(zi, Bi); this leads to an error rate of 4.7%. For the kernel considered, this is almost 
identical to phases I + II of Burges’s RS method, which yield 5.0% (for polynomial 
kernels, the latter method leads to an error of 4.3% with the same increase in speed 
[84]). Finally, Figure 18.10 shows the RSC-20 vectors of the 10 binary classifiers. As 
an aside, note that unlike Burges’s method, which directly tackles the nonlinear 
optimization problem, the present algorithm produces images that look like digits. 

As in [87], good RS construction results were obtained even though the objective 
function did not decrease to zero (in the RS construction experiments, it was 
reduced by a factor of 2 to 20 in the first phase, depending on how many RS 
vectors were computed; the global gradient descent yielded another factor of 2- 
3). We conjecture that this is due to the following: In classification, we are not 
interested in ||¥ —’||, but in f |sgn ($; aik(x, xi) + b) — sgn ie, Bik(x, zi) + 
b)|dP(x), where P is the underlying probability distribution of the patterns (cf. 
[55]). This is consistent with the fact that the performance of an RS SV classifier 
can be improved by re-computing an optimal threshold b. 

The RS selection methods lead to worse results than RS construction; they are 
simpler and computationally faster, however. Of the two RS selection methods de- 
scribed (Tables 18.1 and 18.2), that using Kernel PCA is slightly superior at higher 
reductions. The ¢; penalization approach is computationally cheaper, however, 
since unlike the Kernel PCA based algorithm, it does not remove the SVs one at a 
time, and need not be iterated. 

We conclude this section by briefly describing another study, which combines 
the virtual SV method (Chapter 11) with Burges’s RS algorithm [87]. Virtual SV 
systems have yielded the most accurate results on the widely used MNIST hand- 
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Figure 18.10 Illustration of all the reduced set vectors constructed by the iterative ap- 
proach of Section 18.5.1 for n = 20. The associated coefficients are listed above the images. 
Top: recognizer of digit 0,..., bottom: recognizer of digit 9. Note that positive coefficients 
(roughly) correspond to positive examples in the classification problem. 


written character benchmark (Section A.1). They come with a computational cost, 
however: the resulting classifiers tend to have more SVs than an ordinary SVM. 
They are thus an ideal target for RS methods. Combining both procedures, an in- 
crease in speed of order 50 was achieved over the Virtual SVM in the test phase, 
with only a small decrease in performance (the test error increased from 1.0% to 
1.1%, cf. [87]), leading to a system that is approximately as fast as convolutional 
neural nets [320]. 


18.6 Sequential Evaluation of Reduced Set Expansions 


As described above, RS algorithms typically work by finding RS vectors sequen- 
tially. This implies that we can stop calculating additional RS vectors once the 
approximation is satisfactory. There is, however, another implication that was re- 
cently pointed out [437]: even if we choose to compute a comprehensive RS ex- 
pansion, we might not always want to evaluate it in full for a given problem. For 
instance, if after evaluating the first three RS vectors of a SV classifier we can al- 
ready see the classification outcome is very likely ‘class 1,’ then it is not necessary 
to evaluate the remaining RS vectors. Romdhani et al. [437] applied this idea, to- 
gether with the reduced set construction approach of Section 18.5.1, to the problem 
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Figure 18.11 First 10 reduced set vectors. Note that all vectors can be interpreted as either 
faces (such as the first one) or anti-faces (the second one) [437]. 


Figure 18.12 From left to right: input image, followed by patches that remain after the 
evaluation of 1 (13.3% patches remaining), 10 (2.6%), 20 (0.01%), and 30 (0.002%) filters. Note 
that in these images, the pixels representing the full patch are displayed when it yields 
a positive classification at its center. This explains the apparent discrepancy between the 
above percentages and the visual impression [437]. 


of face detection. 

In face detection, run time classification speed is a major requirement. This is 
due to the fact that we essentially need to look at all locations in a given test image 
in order to detect all faces that are present. The standard approach is to scan a 
binary classifier (trained on the task of distinguishing faces from non-faces) over 
the entire image, looking at one image patch at a time [446, 399]. To make things 
worse, we must usually consider the possibility that faces occur at widely different 
scales, which necessitates several such scans. 

In their experimental investigation of this problem, Romdhani et al. obtain an 
initial SVM with 1742 SVs. Using the method of Section 18.5.1, these were reduced 
to 60 RS vectors. The first ten reduced set vectors are shown in Figure 18.11. 

For each possible cut-off n =1,...,60,a threshold was computed which ensured 
the false negative? rate of the classifier, when run with only the first n RS vectors, 
was sufficiently small. At the nth step, an RS expansion with n vectors was scanned 
over the image. Note that this only entailed computation of one additional kernel 
per image location, since the first n — 1 evaluations were cached. Furthermore, the 
nth scan only had to cover those areas of the image that were not yet discarded as 
clear non-faces. For the data set considered, most of the image parts could typically 
be discarded with a single RS vector (cf. Figure 18.12); on average, each image 
location was “looked at” by 2.8 RS vectors. 

This method is not limited to face detection. It could be used in any task for 


5. By false negatives, we refer to faces that are erroneously classified as non-faces. 


566 


18.7 Summary 


Pre-Images and Reduced Set Methods 


which kernel expansions need to be evaluated quickly. The method demonstrates 
that RS algorithms can assist in incorporating a speed-accuracy trade-off in a 
rather natural way — the longer we are prepared to wait, the more accurate the 
result we get (be it classification or function value estimation). 


Algorithms utilizing positive definite kernels construct their solutions as expan- 
sions ¥ = >", a;,P(x;) in terms of mapped input patterns. The map ® is often 
unknown, however, or is too complex to provide any intuition about the solution 
Y. This has motivated efforts to reduce the complexity of the expansion, as sum- 
marized in this chapter. 

As an extreme case, we first described how to approximate ¥ by a single ®(z) (in 
other words, how to find an approximate pre-image of Y) and gave a fixed-point 
iteration algorithm to perform this task. The procedure was successfully applied 
to the problem of statistical denoising via Kernel PCA reconstruction. 

In situations where no good approximate pre-image exists, we can still re- 
duce the complexity of ¥ = ©: a;®(x;) by expressing it as a sparser expansion, 
byes Bi®(zi) (Nz < Nx). We described methods for computing the optimal coeffi- 
cients /3;, and for obtaining suitable patterns z;, either by selecting among the x; 
or by iterating the above pre-image algorithm to construct synthetic patterns Z;. 
This led to rather useful algorithms for speeding up the evaluation of kernel ex- 
pansions, such as SV decision functions. We reviewed applications of these algo- 
rithms in OCR and face detection. Comparing the RS construction and RS selec- 
tion methods, we observe that the greatest speed gains are usually achieved using 
construction methods. These are computationally more expensive, however, and 
require the input data to lie to RY. 

In the case of face detection, we described a sequential approach that requires 
on average less than 3 kernel evaluations per image patch, making it competitive 
with the fastest available systems. 

Both the pre-image and the reduced set procedures are thus not only of theoret- 
ical interest for feature space methods, but also lead to practical applications. The 
proposed methods are applicable to a variety of feature space algorithms based on 
positive definite kernels. For instance, we could also speed up SV regression ma- 
chines and Kernel PCA feature extractors. We expect further possibilities to open 
up in future, as kernel methods are applied in an increasing range of learning and 
signal processing problems. 
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18.1 (Exact Pre-Images for RBF Kernels [467] ee) Generalize Proposition 18.1 to ker- 
nels which are functions of ||x — x’||. 


18.2 (Exact Pre-Images for Gaussian Kernels ee) Use the reasoning following (18.8) 
to argue that a greedy approach, such as the iterated pre-image method from Section 18.5.1, 
does not necessarily find the optimal reduced set solution. Argue that this supports the use 
of a “phase II” approach (Section 18.5.2; cf. Section 18.2). 


18.3 (Justification of Approximate Pre-Images e) Use the Cauchy-Schwarz inequal- 
ity to show that if ||¥ — ®(z)||? (cf. (18.9) is small, then for any v € H, the dot product 
(¥,v) can be approximated by (®(z),v). Argue that this is all that is needed for a kernel 
algorithm. 

Specialize to the case where v has a pre-image x. 


18.4 (Approximate Single Pre-Images e) Consider the normal vector of a SV hyper- 
plane in feature space. Using your favorite kernel, argue that it is usually impossible to 
find a single pre-image for the normal, as otherwise, the resulting class of decision func- 
tion would only have one term in the kernel expansion, which is not adequate for most 
complex problems. 


18.5 (Reconstruction from Principal Components ee) Devise an alternative to the 
suggested method for computing approximate pre-images from examples expressed in 
terms of their first principal components in feature space. For instance, use a suitable 
multi-output regression method for estimating the reconstruction mapping from the first 
q (q < m) kernel-based principal components to the inputs. Use the method for denoising. 


18.6 (Optimal Coefficient in the 1-Pattern Case e) Prove that the maximum of (18.15) 
can be extended to the minimum of (18.13) by setting 


B= (¥,®(Z)) / ((z), P82). (18.51) 


Discuss the relationship to Proposition 18.2. 


18.7 (Data-Dependent RS Formulations 000) Change the RS objective function ||¥ — 
WI)? to 5; (Ye — Y’), @(x)))”. Argue that this takes into account the distribution of the 
data points x; in a sensible way. Which of the algorithms in the present chapter can be 
generalized to use this objective function? Try to devise efficient algorithms to minimize it. 


18.8 (Speedup via Faster Evaluation of the Full Expansion 000) In [421], a method 
was considered which speeds up the evaluation of an m-term kernel expansion to logm by 
using tree structures. 


(i) Apply this method, and compare the performance to the RS methods described in the 
present chapter. 
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(ii) Can you combine this approach with those taken in the present chapter by designing 
RS algorithms which produce expansions that lend themselves well to the approach of 
[421]? 


18.9 (Minimum Support Solutions occ) Can you use the minimum support methods 
of Mangasarian [346] to devise reduced set methods? 


18.10 (Reduced Sets via Clustering 000) Discuss how you could speed up SVMs by 
clustering the input data, for instance using the k-means algorithm [152]. Would you use 
the cluster centers as training examples? If yes, how would you label them? Can you think 
of modifications to clustering algorithms that would make them particularly suited to a 
reduced set approach? 


18.11 (¢; Penalization for Multi-Class RS Expansions [474] ee) Generalize the ap- 
proach of Section 18.4.2 to the case where you are trying to simultaneously approximate 
several feature space expansions with few RS vectors (overall). 


18.12 (RS Construction for Degree 2 Monomial Kernels [84] eee) Consider the ker- 
nel k(x, x!) = (x, x’)? on RN x IRN. Denote by S the symmetric matrix with elements 


Nx 
Sin = > ailxil [xin (18.52) 
i=1 


Prove that p(8, z) := ||XM, aiP(xi) — BH(Z)||? is minimized for (8, z) satisfying 
52> 0 4,2)2. (18.53) 
Prove, moreover, that with this choice of (8, z), 


p(B, 2) =F, Sn- 8 (2,2). (18.54) 
jn 

Therefore, one should choose the first pre-image z to be the eigenvector of S for which the 
eigenvalue A is largest in absolute value, scaled such that (z,z) = |A|. In this case (cf. 
(18.53)), |B| = 1. 

Generalize this argument to N- > 1, showing that the RS vectors that follow should be 
the next eigenvectors (in terms of the absolute size of the eigenvalues) of S. Argue that this 
shows that using at most N terms, the RS expansion can be made exact. 


18.13 (Direct Approximation of SVC Decision Functions 000) The RS algorithms 
described approximate a SVC decision function by approximating the weight vector nor- 
mal to the hyperplane and then recomputing the threshold. Can you come up with an RS 
approach that addresses the problem in a rather more direct way, using a cost function 
which looks at the approximation of the resulting decision boundary (cf. [55])? 
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A.1 Data Sets 


USPS Set 


MNIST Set 


In the present section, we describe three of the datasets used. They are available 
from http://www.kernel-machines.org/data.html. 

The US Postal Service (USPS) database (see Figure A.1) contains 9298 handwrit- 
ten digits (7291 for training, 2007 for testing), collected from mail envelopes in 
Buffalo [318]. Each digit is a 16 x 16 image, represented as a 256-dimensional vec- 
tor with entries in the range —1 to 1. Preprocessing consisted of smoothing with a 
Gaussian kernel of width ø = 0.75. 

It is known that the USPS test set is rather difficult — the human error rate is 
2.5% [79]. For a discussion, see [496]. Note, moreover, that some of the results re- 
ported in the literature for the USPS set were obtained with an enhanced training 
set. For instance, [148] used an enlarged training set of size 9709 containing some 
additional machine-printed digits. The authors note that this improves the accu- 
racy on the test set. Similarly, [65] used a training set of size 9840. Since there are 
no machine-printed digits in the commonly used test set (size 2007), this addition 
distorts the original learning problem to a situation where results become some- 
what hard to interpret. In our experiments, we only used the original 7291 training 
examples. Results on the USPS problem can be found in Table 7.4. 

The MNIST database (Figure A.2) contains 120000 handwritten digits, divided 
equally into training and test sets. The database is a modified version of NIST 
Special Database 3 and NIST Test Data 1. Both training and test set consist of patterns 
generated by different writers. The images are first size normalized to fit into a 
20 x 20 pixel box, and then centered in a 28 x 28 image [319]. 

Most of the test results on the MNIST database given in the literature [e.g. 
320, 319] for do not use the full MNIST test set of 60000 characters. Instead, a 
subset of 10000 characters is used, consisting of the test set patterns from 24476 to 
34475. To obtain results that can be compared to the literature, we also use this test 
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Figure A.1 The first 100 USPS training images, with class labels (from [467]). 
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Figure A.2 The first 100 MNIST training images, with class labels (from [467]). 
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Figure A.3 The first 100 small-MNIST training images, with class labels (from [467]). 


set, although the larger one would be preferable from the point of view of obtain- 
ing more accurate estimates of the true risk. The error rates on the 10000 element 
test set are estimated to be reliable within about 0.1% [64]. The MNIST bench- 
mark dataset is available from http://www.research.att.com/~yann/ocr/mnist/. 
MNIST results are listed in Table 11.2. 

The USPS database has been criticized (Burges, LeCun, private communication; 
[64]) as not providing the most adequate classifier benchmark. First, it only comes 
with a small test set; and second, the test set contains a number of corrupted pat- 
terns, which not even humans can classify correctly. The MNIST database, which 
is the classifier benchmark currently used in the AT&T and Bell Labs learning re- 
search groups, does not have these drawbacks; moreover, its training set is much 
larger. In some cases, however, it is useful to be able to study an algorithm on a 
smaller database. First, this can save computation time, and second, this allows 
the study of learning from smaller sample sizes. We thus generated a smaller ver- 
sion of the MNIST database which we used in some experiments. It is smaller in 
two respects: First, the patterns have a resolution of 20 x 20, obtained from the 
28 x 28 patterns by downsampling (combined with a Gaussian smoothing of stan- 
dard deviation 0.75 pixels, to avoid aliasing effects). Second, it only comprises a 
subset of the training set, namely the first 5000 patterns. We define the same 10000 
test examples as above as our test set. We refer to this database as the small MNIST 
database (Figure A.3). 
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Abalone 


A.2 Proofs 


Proposition 7.5 


Addenda 


The Abalone database from the UCI repository [56] contains 4177 patterns. It 
is an integer regression problem, the labels are the different ages of Abalones. 
For most practical purposes, it is treated as a generic regression problem in the 
examples of this book. The data is rescaled to zero mean and unit variance 
coordinate-wise, and the gender encoding (male/female/infant) is mapped into 
{(1,0,0), (0, 1,0), (0,0, 1)}. 


Proof Ad (i): By the KKT conditions, p > 0 implies 6 = 0, hence (7.52) becomes 
an equality (cf. (7.46)). Thus, at most a fraction v of all examples can have a; =1/m. 
All examples with £; > 0 satisfy a; = 1/m (if not, a; could grow further to reduce 
fi). 

Ad (ii): SVs can contribute at most 1/m to (7.52), hence there must be at least vm 
of them. 


Ad (iii): This part of the proof is somewhat technical. Readers who prefer to skip 
it may instead consider the following sloppy argument: The difference between (i) 
and (ii) lies only in the points that sit exactly on the edge of the margin, since these 
are SVs with zero slack variables. As the training set size tends to infinity, however, 
only a negligible fraction of points can sit exactly on the margin, provided the 
distribution is well-behaved. 

For the formal proof, note that it follows from the condition on P(x, y) that apart 
from some set of measure zero (arising from possible singular components), the 
two class distributions are absolutely continuous and can be written as integrals 
over distribution functions. As the kernel is analytic and non-constant, it cannot 
be constant in any open set, otherwise it would be constant everywhere. There- 
fore, the class of functions f constituting the argument of the sgn in the SV de- 
cision function ((7.53); essentially, functions in the class of SV regression func- 
tions) transforms the distribution over x into distributions such that for all f 
and all t € R, lim,_,o P(|f(x) + t| < y) = 0. At the same time, we know that the 
class of these functions has well-behaved covering numbers, hence we get uni- 
form convergence: for all y > 0, sup PUF +t] < 7) — Bu(Lf(x) + t| < | con- 
verges to zero in probability, where P,, is the sample-based estimate of P (that 
is, the proportion of points that satisfy |f(x) + t| < y). But then for all a > 0, 
limys0 limmo P(sup p Pnl fœ) + t| < 7) > a) = 0. Hence, sup f Pu(| f(x) + t| = 0) 
converges to zero in probability. Using t = +p thus shows that the fraction of 
points exactly on the margin almost surely tends to zero, hence the fraction of SVs 
equals that of margin errors. Combining (i) and (ii) then shows that both fractions 
converge almost surely to v. = 


Additionally, since (7.51) means that the sums over the coefficients of positive 
and negative SVs respectively are equal, we conclude that Proposition 7.5 actually 


A.2 Proofs 


Proposition 7.7 


Proposition 9.2 
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holds for both classes separately, with v/2. As an aside, note that by the same 
argument, the number of SVs on each side of the margin asymptotically agree. 


Proof Since the slack variable of Xn satisfies Em > 0, the KKT conditions (Chap- 
ter 6) imply a» = 1/m. If ô is sufficiently small, then transforming the point into 
xl, = Xm + Ôw results in a slack that is still nonzero, é, > 0, hence we have 
al, = 1/m = am. Updating the m, and keeping all other primal variables un- 
changed, we obtain a modified set of primal variables that is still feasible. 

We next show how to obtain a corresponding set of feasible dual variables. To 
keep w unchanged, we need to satisfy 
5 QiyiXi = 5 aiy Xi + Opies (A.1) 
i=1 ixzm 
Substituting x}, = Xm + dw and (7.57), we note that a sufficient condition for this to 
hold is that for all i A m, at = aj — ÔYiYiAmYm. 

Since by assumption 4; is only nonzero if a; € (0, 1/m), then aż will be in (0, 1/m) 
if a; € (0,1/m), provided 6 is sufficiently small, and it will equal 1/m if a; =1/m. 
In both cases, we end up with a feasible solution a’, and the KKT conditions are 
still satisfied. Thus (cf. Chapter 6), (w, b) are still the hyperplane parameters of the 
solution. E 


Proof Ad (i): The constraints (9.43) and (9.44) imply that at most a fraction v of 
all examples can have a) = C/m. All examples with au > 0 (in other words, 
those outside the tube) certainly satisfy a® = C/m (if not, a could grow further 
to reduce €"). 

Ad (ii): By the KKT conditions, € > 0 implies 6 = 0. Hence, (9.44) becomes an 
equality (cf. (9.37)).! Since SVs are those examples for which 0 < af? < C/m, the 
result follows (using aja; = 0 for all i, (9.58)). 

Ad (iii): The strategy of proof is to show that asymptotically, the probability of a 
point lying on the edge of the tube vanishes. The condition on P(y|x) means that 


sup Ex y P ([f(~) +t - y| < yIx)] < 6) (A.2) 


’ 


for some function ô(y) that approaches zero as y — 0. Since the class of SV 
regression estimates f has well-behaved covering numbers, we have [14, Chapter 
21] that for all t, 


p (= (Pall f +t- y| < y2) — P(|f(x) +t-y| <)> a) < cc", (AS) 


where P,,, is the sample-based estimate of P (that is, the proportion of points that 
satisfy | f(x) — y + t| < 7), and c1,c2 may depend on y and a. Discretizing the 


1. In practice, we can alternatively work with (9.44) as an equality constraint, provided that 
v is chosen small enough (v < 1) to ensure that it does not pay to make £ negative. 
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Proposition 17.9 


Proposition 17.10 


Addenda 


values of t, taking the union bound (Chapter 5), and applying (A.2) shows that the 
supremum over f and t of P,,( f(x) — y + t = 0) converges to zero in probability. 
Thus, the fraction of points on the edge of the tube almost surely converges to 0. 
Consequently the fraction of SVs equals that of errors. Combining (i) and (ii) then 


shows that both fractions converge almost surely to v. - 


Proof We proceed in a manner similar to the proof of proposition 17.8, but use 
N(e, Fa, d) and 4 to bound RI femp] 


RI femp] = RIf*] = Rife! T RempLfemp] T Remplfemp!] = RIf*] (A-4) 
< e+ RI] E Rempl fil F Rempl femp] =Z R[f*] (A.5) 
Lere me [RIF] — RempLf]| (A.6) 


where V; is the e-cover of F of size N(e, F, Loo(€3)), fi € Ve, and clearly Remp| Temp) = 
Remp[f*]. The application of Hoeffding’s inequality and the union bound, and the 
change of 7 + £ to 7, then prove the claim. a 


Proof The proof uses a clever trick from [292], however without the difficulty of 
also having to bound the approximation error. Since by hypothesis F, is compact, 
we can use Proposition 17.9. We have 


Riel =R= fe {RL fémp] — RI] > n} dn 
0 


CO 
<ute+2(N(e Sod +1) | “Edn 
u+E 


= mu 


2e, 
Luter esir eE 
um 


2e In(N (e, Fe, d)+1) 2ec 
< y = ae a | a NG (A.7) 


Here we use f? exp(—t?/2)dt < exp(—x*/2)/x in the second step. The third in- 
equality is derived by substituting 


w= ZE m(N(€, Ferd) +1). (A.8) 
For part 1, we set € = m—/? and obtain 

Ri ftapl — RLf*] = O (m=! In?/2 m) , (A.9) 
For part 2, (A.7) implies (for some constants c,c’ > 0) 

RL femp] — RIF] < ce"? m7"? + e+ e'e Pm., (A.10) 


The minimum is obtained for € = c’’m71/(@+2) for some c” > 0. Hence the overall 
1 
term is of order O(m7 =), as required. E 


Mathematical Prerequisites 


The beginner... should not be discouraged if... he finds that he does not have the prerequi- 


sites for reading the prerequisites. 
P. Halmos! 


In this chapter, we introduce mathematical results that might not be known to 
all readers, but which are sufficiently standard that they not be put into the actual 
chapters. 

This exposition is almost certainly incomplete, and some readers will inevitably 
happen upon terms in the book that are unknown to them, yet not explained here. 
Consequently, we also give some further references. 


B.1 Probability 


Domain 


Event 
Probability 


B.1.1 Probability Spaces 


Let us start with some basic notions of probability theory. For further detail, we 
refer to [77, 165, 561]. We do not try to be rigorous; instead, we endeavor to give 
some intuition and explain how these concepts are related to our present interests. 

Assume we are given a nonempty set X, called the domain or universe. We refer to 
the elements x of X as patterns. The patterns are generated by a stochastic source. 
For instance, they could be handwritten digits, which are subject to fluctuations in 
their generation best modelled probabilistically. In the terms of probability theory, 
each pattern x is considered the outcome of a random experiment. 

We would next like to assign probabilities to the patterns. We naively think of 
a probability as being the limiting frequency of a pattern; in other words, how 
often, relative to the number of trials, a certain pattern x comes up in a random 
experiment, if we repeat this experiment infinitely often? 

It turns out to be convenient to be slightly more general, and to talk about the 
probability of sets of possible outcomes; that is, subsets C of X called events. We 
denote the probability that the outcome of the experiment lies in C by 


1. Quoted after [429]. 
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P{x €C}. (B.1) 


If Y is a logical formula in terms of x, meaning a mapping from X to { true, false}, 
then it is sometimes convenient to talk about the probability of Y being true. We 
will use the same symbol P in this case, and define its usage as 


P{Y(x)} := P{x € C} where C = {x € X|Y(x) = true}. (B.2) 
Let us also introduce the shorthand 
P(C) := P{x € C}, (B.3) 


to be read as “the probability of the event C.” If P satisfies some fairly natural 
conditions, it is called a probability measure. It is also referred to as the (probability) 
distribution of x. 

In the case where X C RN, the patterns are usually referred to as random variables 
(N = 1) or random vectors (N > 1). A generic term we shall sometimes use is random 
quantity.” 

To emphasize the fact that P is the distribution of x, we sometimes denote it as 
P, or P(x).3 To give the precise definition of a probability measure, we first need to 
be a bit more formal about which sets C we are going to allow. Certainly, 


C=X (B.4) 


should be a possibility, corresponding to the event that necessarily occurs (“sure 
thing”). If C is allowed, then its complement, 


C=% C, (B.5) 


should also be allowed. This corresponds to the event “not C.” Finally, if C1, C2, . . . 
are events, then we would like to be able to talk about the probability of the event 
“Cı or C; or ...”, hence 


oO 

UG (B.6) 
i=1 

should be an allowed event. 


Definition B.1 (a-Algebra) A collection € of subsets of X is called a o-algebra on X if 


(i) X € C; in other words, (B.4) is one of its elements; 
(ii) it is closed under complementation, meaning if C € €, then also (B.5); and 


Gii) it is closed under countable* unions: if C1, C3, . . . € C, then also (B.6). 


2. For simplicity, we are somewhat sloppy in not distinguishing between a random variable 
and the values it takes. Likewise, we deviate from standard usage in not having introduced 
random variables as functions on underlying universes of events. 

3. The latter is somewhat sloppy, as it suggests that P takes elements of X as inputs, which 
it does not: P is defined for subsets of X. 

4. Countable means with a number of elements not larger than that of N. Formally, a set 
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The elements of a o-algebra are sometimes referred to as measurable sets. 


We are now in a position to formalize our intuitions about the probability measure. 


Definition B.2 (Probability Measure) Let € be a o-algebra on the domain X. A func- 
tion 


P : € —> [0,1] (B.7) 
is called a probability measure 1f it is normalized, 
P(X) = 1, (B.8) 


and o-additive, meaning that for sets C1, C2,... € € that are mutually disjoint (C; AÑ C; = 
Ø if i Æ j), we have 


P (U ci) = ¥'P(C). (B.9) 
i=l i=1 


As an aside, note that if we drop the normalization condition, we are left with 
what is called a measure. 

Taken together, (X, €, P) are called a probability space. This is the mathematical 
description of the probabilistic experiment. 


B.1.2 IID Samples 


Nevertheless, we are not quite there yet, since most of the probabilistic statements 
in this book do not talk about the outcomes of the experiment described by 
(X, C€, P). For instance, when we are trying to learn something about a regularity 
(that is, about some aspects of P) based on a collection of patterns x1,...,Xm E€ X 
(usually called a sample), we actually perform the random experiment m times, 
under identical conditions. This is referred to as drawing an tid (independent and 
identically distributed) sample from P. 

Formally, drawing an iid sample can be described by the probability space 
(x, €", P”). Here, X” denotes the m-fold Cartesian product of X with itself (thus, 
each element of X” is an m-tuple of elements of X), and €” denotes the smallest o- 
algebra that contains the elements of the m-fold Cartesian product of € with itself. 
Likewise, the product measure P” is determined uniquely by 


m 


P™(Cy,..-;Cm)) = [T P(C). (B.10) 
i=1 


Note that the independence of the “iid” is encoded in (B.10) being a product of 
measures on C, while the identicality lies in the fact that all the measures on € are 
one and the same. 


is countable if there is a surjective map from N onto this set; that is, a map with range 
encompassing the whole set. 
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By analogy to (B.2), we sometimes talk about the probability of a logical formula 
involving an m-sample,> 


PIY (xie esim) y = PO Miss ohn) E N” Vita eg) = true}. (B.11) 


So far, we have denoted the outcomes of the random experiments as x for 
simplicity, and have referred to them as patterns. In many cases studied in this 
book, however, we will not only observe patterns x € X but also targets y € Y. 
For instance, in binary pattern recognition, we have Y = {+1}. The underlying 
regularity is now assumed to generate examples (x, y). All of the above applies to 
this case, with the difference that we now end up with a probability measure on 
X x Y, called the (joint) distribution of (x, y). 


B.1.3 Densities and Integrals 


We now move on to the concept of a density, often confused with the distribution. 
For simplicity, we restrict ourselves to the case where X = RN; in this instance, € is 
usually taken to be the Borel o-algebra.® 


Definition B.3 (Density) We say that the nonnegative function p is the density of the 
distribution P if for all C € €, 


P(C) = f p(x)dx. (B.12) 
Č 
If such a p exists, it is uniquely determined.” 


Not all distributions actually have a density. To see this, let us consider a distri- 
bution that does. If we plug a set of the form C = {x} into (B.12), we see that 
P({x}) = 0; that is, the distribution assigns zero probability to any set of the form 
{x}. We infer that only distributions that assign zero probability to individual 
points can have densities. 

It is important to understand the difference between distributions and densities. 
The distribution takes sets of patterns as inputs, and assigns them a probability 
between 0 and 1. The density takes an individual pattern as its input, and assigns 
a nonnegative number (possibly larger than 1) to it. Using (B.12), the density can be 
used to compute the probability of a set C. If the density is a continuous function, 
and we use a small neighborhood of point x as the set C, then P is approximately 


5. Note that there is some sloppiness in the notation: strictly speaking, we should denote 
this quantity as P” — usually, however, it can be inferred from the context that we actually 
mean the m-fold product measure. 

6. Readers not familiar with this concept may simply think of it as a collection that contains 
all “reasonable” subsets of R^. 

7. Almost everywhere; in other words, up to a set N with P(N) = 0. 

8. In our case, we can show that the distribution P has a density if and only if it is absolutely 
continuous with respect to the Lebesgue measure on RY, meaning that every set of Lebesgue- 
measure zero also has P-measure zero. 
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the size (i.e. , the measure) of the neighborhood times the value of p; in this case, 
and in this sense, the two quantities are proportional. 

A more fundamental concept, which exists for every distribution of a random 
quantity taking values in RN, is the distribution function,? 


F: RY > [0,1] (B.13) 
z> F(z) = P{[xhi < [zh A... A[x]n < [z]n}. (B.14) 


Finally, we need to introduce the notion of an integral with respect to a measure. 
Consider a function f : RY — R. We denote by 


[IOR (B.15) 


the integral of a function with respect to the distribution (or measure) P, provided 
that f is measurable. For our purposes, the latter means that for every interval 
[a,b] C R, f—\([a, b]) (the set of all points in R that get mapped to [a,b]) is an 
element of €. Component-wise extension to vector-valued functions is straightfor- 
ward. 

In the case where P has a density p, (B.15) equals 


[IPs (B.16) 


which is a standard integral in R^, weighted by the density function p. 

If P does not have a density, we can define the integral by decomposing the 
range of f into disjoint half-open intervals [a;,b;), and computing the measure 
of each set f—1([a;,b;)) using P. The contribution of each such set to the integral 
is determined by multiplying this measure with the function value (on the set), 
which by construction is in [a;, b;). The exact value of the integral is obtained by 
taking the limit at infinitely small intervals. This construction, which is the basic 
idea of the Lebesgue integral, does not rely on f being defined on R; it works for 
general sets X as long as they are suitably endowed with a measure. 

Let us consider a special case. If P is the empirical measure with respect to 
Ngee ea 


EN Mig ces May | 
= ? 


PeO: z 


emp 


(B.17) 


which represents the fraction of points that lie in C, then the integral takes the form 


m 


EDA (6.18) 


1 


As an aside, note that this shows the empirical risk term (1.17) can actually be 
thought of as an integral, just like the actual risk (1.18). 


9. We use A to denote the logical “and” operation, and [z]; to denote the i*component of z. 
10. By |.| we denote the number of elements in a set. 
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If P is a probability distribution (rather than a general measure), then two more 
special cases of interest are obtained for particular choices of functions f in (B.15). 
If f is the identity on RY, we get the expectation E[x]. If f(x) = (x —E[x])* (on 
R), we obtain the variance of x, denoted by var(x). In the N-dimensional case, the 
functions fj ;(x) = (x; — E[x;])(x; — E[x;]) lead to the covariance cov(x;, xj). For a data 
set {x1,...,Xm}, the matrix (cov(x;, x;));j is called the covariance matrix. 


B.1.4 Stochastic Processes 


A stochastic process y on a set X is a random quantity indexed by x € X. This means 
that for every x, we get a random quantity y(x) taking values in R, or more gener- 
ally, in a set R. A stochastic process is characterized by the joint probability distri- 
butions of y on arbitrary finite subsets of X; in other words, of (y(x1),..., HE 

A Gaussian process is a stochastic process with the property that for any 
{x1,...,Xm} C X, the random quantities (y(x1), . . . , Y(Xm)) have a joint Gaussian 
distribution with mean u and covariance matrix K. The matrix elements Kj; are 
given by a covariance kernel k(x;, x;). 

When a Gaussian process is used for learning, the covariance function k(x;, xj) ‘= 
cov(y(x;), y(x;)) essentially plays the same role as the kernel in a SVM. See Section 
16.3 and [587, 596] for further information. 


B.2 Linear Algebra 


B.2.1 Vector Spaces 


We move on to basic concepts of linear algebra, which is to say the study of 
vector spaces. Additional detail can be found in any textbook on linear algebra 
(e.g., [170]). The feature spaces studied in this book have a rich mathematical 
structure, which arises from the fact that they allow a number of useful operations 
to be carried out on their elements: addition, multiplication with scalars, and the 
product between the elements themselves, called the dot product. 

What’s so special about these operations? Let us, for a moment, go back to our 
earlier example (Chapter 1), where we classify sheep. Surely, nobody would come 
up with the idea of trying to add two sheep, let alone compute their dot product. 
The set of sheep does not form a vector space; mathematically speaking, it could 
be argued that it does not have a very rich structure. However, as discussed in 
Chapter 1 (cf. also Chapter 2), it is possible to embed the set of all sheep into a 
dot product space such that we can think of the dot product as a measure of 


11. Note that knowledge of the finite-dimensional distributions (fdds) does not yield com- 
plete information on the properties of the sample paths of the stochastic process; two dif- 
ferent processes which have the same fdds are known as versions of one another. 
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the similarity of two sheep. In this space, we can perform the addition of two 
sheep, multiply sheep with numbers, compute hyperplanes spanned by sheep, 
and achieve many other things that mathematicians like. 


Definition B.4 (Real Vector Space) A set H is called a vector space (or linear space) 
over R if addition and scalar multiplication are defined, and satisfy (for all x,x',x" € K, 
and A, XN ER) 


x+ (x! +x") =(x+x’) +x", (B.19) 
x+x'=x'+xEH, (B.20) 
OE, x+0=x, (B.21) 
—xE€H, —x+x=0, (B.22) 
MEH, (B.23) 
1x =x, (B.24) 
ACN) = CA IX, (B.25) 
Mx +x’) = Ax + Ax’, (B.26) 
(A + X’)x = Ax + X’x. (B.27) 


The first four conditions amount to saying that (H, +) is a commutative group. 1 


We have restricted ourselves to vector spaces over R. The definition in the complex 
case is analogous, both here and in most of what follows. Any non-empty subset 
of H that is itself a vector space is called a subspace of H. 

Among the things we can do in a vector space are linear combinations, 


m 

5 AiXi, where A; E R,x; € H, (B.28) 

i=1 

and convex combinations, 

m 
Aixi, where A; > 0, ¥\; =1,x; € H. (B.29) 

= i 


i=l 


The set {};4 Aix;|A; € R} is referred to as the span of the vectors x1, . . . , Xm. 

A set of vectors x;, chosen such that none of the x; can be written as a linear 
combination of the others, is called linearly independent. A set of vectors x; that 
allows us to uniquely write each element of H as a linear combination is called a 
basis of H. For the uniqueness to hold, the vectors have to be linearly independent. 
All bases of a vector space H have the same number of elements, called the 
dimension of H. 

The standard example of a finite-dimensional vector space is RN, the space 
of column vectors ([x]1,...,[<]v)', where the T denotes the transpose. In RY, 


12. Note that (B.21) and (B.22) should be read as existence statements. For instance, (B.21) 
states that there exists an element, denoted by 0, with the required property. 
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addition and scalar multiplication are defined element-wise. The canonical basis 
of RY is {e1,... en}, where for j = 1,...,N, [ej]i = dij. Here dj; is the Kronecker 
symbol; 


iaa 
5 = nae (B.30) 
0 otherwise. 


A somewhat more abstract example of a vector space is the space of all real- 


valued functions on a domain X, denoted by RY. Here, addition and scalar multi- 
plication are defined by 


(f + g)(x) := f(x) + g(x), (B.31) 
(Af )(x) = Af). (B.32) 


We shall return to this example below. 

Linear algebra is the study of vector spaces and linear maps (sometimes called 
operators) between vector spaces. Given two real vector spaces Hı and Hn, the 
latter are maps 


L: Hı > Hp (B.33) 
that for all x, x’ € H, À, X € R satisfy 
L(Ax + A’x’) = AL(x) + X' L(x’). (B.34) 


It is customary to omit the parentheses for linear maps; thus we normally write Lx 
rather than L(x). 

Let us go into more detail, using (for simplicity) the case where Hı and H3 are 
identical, have dimension N, and are written H. Due to (B.34), a linear map L is 
completely determined by the values it takes on a basis of H. This can be seen by 
writing an arbitrary input as a linear combination in terms of the basis vectors ej, 
and then applying L; 


N N 
LY! Ajej = X AjLe;. (B.35) 
j=l j=l 


The image of each basis vector, Lej, is in turn completely determined by its expan- 
sion coefficients Aj;,i =1,...,N; 


N 

Le; = > A;je;. (B.36) 
i=1 

The coefficients (Ajj) form the matrix A of L with respect to the basis {e1,...,en}. 


We often think of linear maps as matrices in the first place, and use the same 
symbol to denote them. The unit (or identity) matrix is denoted by 1. Occasionally, 
we also use the symbol 1 as the identity map on arbitrary sets (rather than vector 
spaces). 

In this book, we assume elementary knowledge of matrix algebra, including the 
matrix product, corresponding to the composition of two linear maps, 
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N 
n=l 
and the transpose (A! );j := Aji. 

The inverse of a matrix A is written A~! and satisfies AAT! = A~'A = 1. The 
pseudo-inverse At satisfies AAtA = A. While every matrix has a pseudo-inverse, 
not all have an inverse. Those which do are called invertible or nonsingular, and 
their inverse coincides with the pseudo-inverse. Sometimes, we simply use the 
notation A~!, and it is understood that we mean the pseudo-inverse whenever A 
is not invertible. 


B.2.2 Norms and Dot Products 
Thus far, we have explained the linear structure of spaces such as the feature space 


induced by a kernel. We now move on to the metric structure. To this end, we 
introduce concepts of length and angles. 


Definition B.5 (Norm) A function ||- ||: H + Rọ that for all x,x' € H and A € R 

satisfies 
Ix +x! < Ill + Ixl, (B.38) 
xl] = [Allbel (B.39) 
Ix\| > 0 ifx £0, (B.40) 


is called a norm on K. If we replace the “>” in (B.40) by “>,” we are left with what is 
called a semi-norm. 


Any norm defines a metric d via 
d(x,x’) := ||x — x’||; (B.41) 


likewise, any semi-norm defines a semi-metric. The (semi-)metric inherits certain 
properties from the (semi-)norm, in particular the triangle inequality (B.39) and 
positivity (B.40). 

While every norm gives rise to a metric, the converse is not the case. In this 
sense, the concept of the norm is stronger. Similarly, every dot product (to be 
introduced next) gives rise to a norm, but not vice versa. 

Before describing the dot product, we start with a more general concept. 


Definition B.6 (Bilinear Form) A bilinear form on a vector space H is a function 
Q:HxH-OR 

(x, x!) > Q(x, x’) (B.42) 
with the property that for all x,x’,x” € H and all A, X € R, we have 
Q((Ax + NX’), x”) = AQ(x, x") + AQ, x”), (B.43) 
Q(x”, (Ax + A'X!)) = AQX”, x) + AQ", x’). (B.44) 
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If the bilinear form also satisfies 
Q(x, x!) = Q(x’, x) (B.45) 


for all x,x’ € H, it is called symmetric. 


Definition B.7 (Dot Product) A dot product on a vector space H is a symmetric 
bilinear form, 


(A HXH >R 
xx) H (x,x’), (B.46) 


that is strictly positive definite; in other words, it has the property that for all x € K, 
(x, x) > 0 with equality only for x = 0. (B.47) 


Definition B.8 (Normed Space and Dot Product Space) A normed space is a vec- 
tor space endowed with a norm; a dot product space (sometimes called pre-Hilbert 
space) is a vector space endowed with a dot product. 


Any dot product defines a corresponding norm via 


[|x|] = y (x). (B.48) 


We now describe the Cauchy-Schwarz inequality: For all x,x’ € K, 
[ (x, ) | < xlix], (B.49) 


with equality occurring only if x and x’ are linearly dependent. In some instances, 
the left hand side can be much smaller than the right hand side. An extreme case 
is when x and x’ are orthogonal, and (x, x’) = 0. 

One of the most useful constructions possible in dot product spaces are orthonor- 
mal basis expansions. Suppose e1, .. ., en, where N € N, form an orthonormal set; that 
is, they are mutually orthogonal and have norm 1. If they also form a basis of K, 
they are called an orthonormal basis (ONB). In this case, any x € H can be written 
as a linear combination, 


N 
x= 2% e;) ej. (B.50) 
j= 


The standard example of a dot product space is again R. We usually employ 
the canonical dot product, 


N 

x)= SV bd] = xls (B.51) 
f= 1 

and refer to R as the Euclidean space of dimension N. Using this dot product and 

the canonical basis of RY, each coefficient (x, ej} in (B.50) just picks out one entry 

from the column vector x, thus x = Li bd jej- 
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A rather useful result concerning norms arising from dot products is the 
Pythagorean Theorem. In its general form, it reads as follows: 


Theorem B.9 (Pythagoras) If e;,...,e, are orthonormal (they need not form a basis), 
then 
2 


1 1 
Ixl? = È ey + lix- X & ese: (B.52) 
i=l i=l 


Now that we have a dot product, we are in a position to summarize a number of 
useful facts about matrices. 


= It can readily be verified that for the canonical dot product, we have 
(x, Ax’) = (ATx,x’) (B.53) 


for all x,x’ EH 


= Matrices A such that A = A! are called symmetric. Due to (B.53), they can 
be swapped between the two arguments of the canonical dot product without 
changing its value 


= Symmetric matrices A that satisfy 
(x, Ax) >0 (B.54) 


for all x € H are called positive definite (cf. Remark 2.16 for a note on this terminol- 
ogy) 


# Another interesting class of matrices are the unitary (or orthogonal) matrices. A 
unitary matrix U is characterized by an inverse U7! that equals its transpose UT. 
Unitary matrices thus satisfy 


(Ux, Ux!) = (uTUx,x') = (UUx,x’) = (x,x’) (B.55) 


for all x, x’ € H; in other words, they leave the canonical dot product invariant 


= A final aspect of matrix theory of interest in machine learning is matrix diago- 
nalization. Suppose A is a linear operator. If there exists a basis v1,...,vn of H 
such that for alli=1,...,N, 


AV; = AiVis (B.56) 


with A; € R then A can be diagonalized: written in the basis vi,...,vn, we have 
Ajj = 0 for alli # j and Aj; = À; for all i. The coefficients A; are called eigenvalues, 
and the v; eigenvectors, of A 

Let us now consider the special case of symmetric matrices. These can always 
be diagonalized, and their eigenvectors can be chosen to form an orthonormal 
basis with respect to the canonical dot product. If we form a matrix V with these 
eigenvectors as columns, then we obtain the diagonal matrix as VAV". 

Rayleigh’s principle states that the smallest eigenvalue Amin coincides with the 
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minimum of 


R(v) := AI 


The minimizer of R is an eigenvector with eigenvalue Amin. Likewise, the largest 
eigenvalue and its corresponding eigenvector can be found by maximizing R. 
Functions f : I + R, where I C R, can be defined on symmetric matrices A with 
eigenvalues in I. To this end, we diagonalize A and apply f to all diagonal 
elements (the eigenvalues). 

Since a symmetric matrix is positive definite if and only if all its eigenvalues are 
nonnegative, we may choose f(x) = /x to obtain the unique square root V/A of a 
positive definite matrix A. 


(B.57) 


Many statements about matrices generalize in some form to operators on spaces 
of arbitrary dimension; for instance, Mercer’s theorem (Theorem 2.10) can be 
viewed as a generalized version of a matrix diagonalization, with eigenvectors 
(or eigenfunctions) ~; satisfying fy k(x, x’)y)i(x’) d(x’) = A jpj). 


B.3 Functional Analysis 


Cauchy Sequence 


Banach / Hilbert 
Space 


Functional analysis combines concepts from linear algebra and analysis. Conse- 
quently, it is also concerned with questions of convergence and continuity. For a 
detailed treatment, cf. [429, 306, 112]. 


Definition B.10 (Cauchy Sequence) A sequence (xj); := (xi)ien = (X1,X2,---) in a 
normed space H is said to be a Cauchy sequence if for every e > 0, there exists an n € N 
such that for all n', n” > n, we have ||xw — Xy|| < €. 

A Cauchy sequence is said to converge to a point x € H if ||xn — x|| — 0 as n > oo. 


Definition B.11 (Completeness, Banach Space, Hilbert Space) A space H is called 
complete if all Cauchy sequences in the space converge. 

A Banach space is a complete normed space; a Hilbert space is a complete dot product 
space. 


The simplest example of a Hilbert space (and thus also of a Banach space) is 
again RN. More interesting Hilbert spaces, however, have infinite dimensionality. 
A number of surprising things can happen in this case. To prevent the nasty ones, 
we generally assume that the Hilbert spaces we deal with are separable,!? which 
means that there exists a countable dense subset. A dense subset is a set S such that 
each element of H is the limit of a sequence in S. Equivalently, the completion of 


13. One of the positive side effects of this is that we essentially only have to deal with one 
Hilbert space: all separable Hilbert spaces are equivalent, in a sense that we won’t define 
presently. 
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Projections 


Orthonormal 
Basis 


S equals H. Here, the completion S is obtained by adding all limit points of Cauchy 
sequences to the set. 14 


Example B.12 (Hilbert Space of Functions) Let C[a, b] denote the real-valued contin- 
uous functions on the interval [a,b]. For f,g € C[a,b], 


b 
(f8) := i f (x)g(x) dx (B.58) 


defines a dot product. The completion of C[a,b] in the corresponding norm is the Hilbert 
space L»[a, b] of measurable functions!» that are square integrable; 


Í i f(x) dx < œ. (B.59) 


This notion can be generalized to L2(IRN , P). Here, P is a Borel measure on RN, and the 
dot product is given by 


(F,3)= f FOE) dP). (B.60) 


One of the most useful properties of Hilbert spaces is that as in the case of finite- 
dimensional vector spaces, it is possible to compute projections. Before stating the 
theorem, recall that a subset M of H is called closed if every convergent sequence 
in H with elements that lie in M also has its limit in M. Any closed subspace of a 
Hilbert space is itself a Hilbert space. 


Theorem B.13 (Projections in Hilbert Spaces) Let H be a Hilbert space and M be a 
closed subspace. Then every x € H can be written uniquely as x = z + z+, where z € M 
and (z+,t) =0 forall t € M. The vector z is the unique element of M minimizing ||x — z||; 
it is called the projection Px := z of x onto M. The projection operator P is a linear map. 


Another feature of Hilbert spaces is that they come with a useful generaliza- 
tion of the concept of a basis. Recall that a basis is a set of vectors that allows 
us to uniquely write each x as a linear combination. In the context of infinite- 
dimensional Hilbert spaces, this is quite restrictive (note that linear combinations 
(B.28) always involve finitely many terms) and leads to bases that are not count- 
able. Therefore, we usually work with what is called a complete orthonormal system 
or an orthonormal basis (ONB).'® Formally, this is defined as an orthonormal set S 
in a Hilbert space H with the property that no other nonzero vector in H is orthog- 
onal to all elements of S. 


14. Note that the completion is denoted by the same symbol as the set complement. Math- 
ematics is full of this kind of symbol overloading, which adds to the challenge. 

15. These are not strictly speaking individual functions, but equivalence classes of func- 
tions that are allowed to differ on sets of measure zero. 

16. These systems are often referred to as bases in the context of Hilbert spaces. This is 
slightly misleading, since they are not bases in the vector space sense. 
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Separable Hilbert spaces possess countable ONBs, which can be constructed 
using the Gram-Schmidt procedure. Suppose {v;}ic, is a linearly independent set 
of vectors with a span dense in H, with A being a countable index set. A countable 
ONB ej, e2,... can then be constructed as follows: 


e1 := vi/||vill, 
eo := (v2 _ Piv2)/||v2 = Pivoll, 


e3 := (v3 — Pyv3)/||v3 — Pova||, 
(B.61) 


Here, we use the shorthand P,, for the operator 
n 
Pixie ¥ (e;,x)e;. (B.62) 
i=1 
It is easy to show that P, projects onto the subspace spanned by e1,..., en. 

If the v1,v2,... are not linearly independent, then it is possible that v,41 — 
PiVn41 = 0, which means v,+41 can be expressed as a linear combination of 
V1,.-.,Vn- In this case, we simply leave out v,,41 and proceed with v,,42, shifting 
all subsequent indices by 1. 

Using an ONB, we can give basis expansions in infinite-dimensional Hilbert 
spaces, which look just like basis expansions in the finite-dimensional case. For 
separable Hilbert spaces, the index set A is countable. 


Theorem B.14 (ONB Expansions & Parseval’s Relation) Let {e;}ic, be an ONB of 
the Hilbert space K. Then for each x € K, 


x= > (e;,x)e; (B.63) 
IEA 
and 
k= Fa (B.64) 
ICA 


Note that this generalizes the Pythagorean Theorem to the infinite-dimensional 
case. 

Let us describe an application of this result, with the dual purpose of demon- 
strating a standard trick from functional analysis, and mathematically justifying a 
crucial step in the “kernelization” of many algorithms. In Kernel PCA, we need to 
solve an eigenvalue problem of the form (cf. (14.7)) 


àv =Cy, (B.65) 


and we know a priori that all solutions v lie in the span of x1, ...,Xm E€ H. In Chap- 
ter 14, we argued that this means we may instead consider the set of equations 


(Xn, V1) = (Xn, V2) forall a =1,...,m, (B.66) 


where we use the shorthand vı = Av and v2 = Cv. 
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4 Spaces 


We are now in a position to prove this formally. It suffices to consider the case 
where the {xj,...,Xm} are orthonormal. If they are not, we first apply the Gram- 
Schmidt procedure to construct an orthonormal set {e;,...,e,}. The latter is a 
basis for the span of the x;, hence each x; can be written as a linear combination of 
the e;. Conversely, each e; can be written as a linear combination of the x; by con- 
struction (cf. (B.61)). Therefore, (B.66) is actually equivalent to the corresponding 
statement for the orthonormal set {e,..., en}. 

In the orthonormal case, the Parseval relation (B.64), applied to the completion 
of the span of the x; (which is a Hilbert space), implies that (we replace x = vı — v2) 


m 


Ivi = val? = Ei 1) — (Ki, V2). (B.67) 


Therefore vı = vp if and only if (14.8) holds true. In a nutshell, the ONB expansion 

coefficients are unique and completely characterize each vector in a Hilbert space. 
We next revisit the Lz space. Since we will be using the complex exponential, we 

consider for a moment the case where H is a Hilbert space over C rather than R. 


Example B.15 (Fourier Series) The collection of functions!” 
ix —ix 2ix —2ix 
\<e eee} (B.68) 
V2n V2T V2T V2T 


is an ONB for L,[0,27]. As a consequence of Theorem B.14, we can thus expand any 
f € L,[0, 27] as 


li ne n B. 
= = : (B.69) 


where the Fourier coefficients cn are given by!® 


"ei" F(x) de, (B.70) 


Cn = 


1 
V2 J 
B.3.1 Advanced Topics 


We now move on to concepts that are only used in a few of the chapters; these 
mainly comprise results that build on [606]. We define normed spaces 4} as fol- 
lows: As vector spaces, they are identical to R, but are endowed in addition with 
p-norms. For 1 < p < œ, these p-norms are defined as 


N 1/p 
Ixl = Ilxllp = (3 sit ; (B.71) 


j=l 


17. Here i is the imaginary unit /—1. 

18. Comparing this to (B.60), we note that there is an unexpected minus sign in e™™. 
This is due to the fact that in the complex case, the dot product (B.60) includes a complex 
conjugation in the first argument. 
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for p = 00, as 
Ilx := Ilælloo = max lxil. (B.72) 


We use the shorthand £, to denote the case where N = ov. In this case, it is 
understood that l, contains all sequences with finite p-norm. For N = œ, the max 
in (B.72) is replaced by a sup. 

Often the above notations are also used in the case where 0 < p < 1. In that case, 
however, we are no longer dealing with norms. 

Suppose F is a class of functions f : X —> R. The 4N, norm of f € F with respect to 
a sample X = (X1,...,Xm) is defined as 


IIfllex = max | f(x). (B.73) 
Likewise, 
[Lfllex = EE, - «5 f€). (B.74) 


Given some set X with a o-algebra, a measure u on X, some p in the range 
1< p< œ, and a function f : X —> R, we define 


1/p 
MAlo = Ile = (SoPa) (B.75) 
if the integral exists, and 
Ifl = Ilf llo = ess ap [F(x)|. (B.76) 
xE 


Here, ess sup denotes the essential supremum; that is, the smallest number that 
upper bounds |f (x)| almost everywhere. 
For 1 < p < œ, we define 


LX) = {f : X >R | Ifl < œ}. (B.77) 


Here, we have glossed over some details: in fact, these spaces do not consist of 
functions, but of equivalence classes of functions differing on sets of measure 
zero. An interesting exception to this rule are reproducing kernel Hilbert spaces 
(Section 2.2.3). For these, we know that point evaluation of all functions in the 
space is well-defined: it is determined by the reproducing kernel, see (2.29). 

Let £(E, G) be the set of all bounded linear operators T between the normed 
spaces (E, ||- ||e) and (G, || - ||c); in other words, operators such that the image of 
the (closed) unit ball, 


Up :={xeEE | \|x||z < 1}, (B.78) 
is bounded. The smallest such bound is called the operator norm, 


||T|| := sup ||Tx||c. (B.79) 
xEUE 
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example, 578 
expectation, 580 
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extreme point, 153, 447 


fat shattering dimension, 392 
feasible point, 166 
feature, 428 
extraction, 431, 443 
map, 32, 93 
continuity, 41 
empirical kernel, 42 
GNS, 425 
kernel PCA, 43 
Mercer, 36 
normalized, 413 
pairwise, 44 
reproducing kernel, 32 
monomial, 26 
polynomial, 26 
product, 26 
space, 3, 39, 429 
equivalence, 39 
infinite-dimensional, 47 
RKHS, 35 
Fisher 
information, 72 
matrix, 419 
score, 419 
Fletcher-Reeves method, 163 


Gamma 
distribution, 506 
function, 107 
Gaussian 
approximation, 478 
kernel, 45 
process, 481 
Generalized Portrait, 11 
generative model, 529 
generative topographic map, 530 
Gibbs classifier, 381 
global minimum, 152 
gradient descent, 157, 315 
conjugate, 160 
Gram-Schmidt orthonormalization, 
588 
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graphical model, 418 
greedy selection, 291 
growth function, 9, 140, 393 


Heaviside function, 303 
hidden Markov model, 418 
Hilbert space, 35, 586 
reproducing kernel, 36 
separable, 586 
hit rate, 286 
Hough transform, 225 
Huber’s loss, 76 
hyperparameter, 475 
hyperplane, 4 
canonical, 12 
optimal, 11 
separating, 11, 168, 189 
soft margin, 16 
supporting, 237 


iid sample, 577 

implementation, 279 

induction principle, 127 

input, 1 

instance, 1 

integral operator, 30 

interior point, 295 
method, 175 

interval cutting, 155 

invariance 
translation, 46 
unitary, 46 


Karush-Kuhn-Tucker conditions, 13, 
166 

differentiable, 170 
necessary, 169 

kernel, 2, 30 
B-spline, 46 
R-convolution, 411 
admissible, 30 
alignment, 57 
ANOVA, 272, 411 
Bayes, 57 
codon-improved, 417 


Index 


conditionally positive definite, 
49,53, 118 

conformal transformation, 408 

direct sum, 411 

DNA, 417 

examples, 45 

feature analysis, 443 

Fisher, 418 

for structured objects, 411 

Gaussian, 45, 46, 402, 411 

Hilbert space representation, 29 

infinitely divisible, 53 

inhomogeneous polynomial, 46 

jittered, 354 

Laplacian, 402 

locality-improved, 417 

locally linear embedding, 436 

map, see feature map 

Mercer, 30 

monomial, 27 

natural, 418, 419 

nonnegative definite, 30 

on a o-algebra, 47 

on a group, 424 

optimal, 423 

PCA, 41, 92, 588 

pd, 31 

polynomial, 27, 45, 56 

positive definite, 30, 34 

properties, 45 

RBF, 46 

regularization interpretation, 94 

reproducing, 30, 33 

scaling, 216 

set, 47 

sigmoid, 45 

sparse vector, 412 

spline, 98 

strictly positive definite, 31 

symmetric, 2 

tanh, 45 

tensor product, 410 

trick, 15, 34, 195, 201, 429 


KKT, see Karush-Kuhn-Tucker condi- 
tions 

KKT gap, 170, 282 

Kronecker delta, 582 

Kuhn-Tucker conditions, see Karush- 
Kuhn-Tucker conditions 


label, 1 
Lagrange multiplier, 13 
Lagrangian, 13, 166 
SVM, 318 
pseudocode, 320 
Laplace approximation, 488 
Laplacian Prior, 501 
Laplacian process, 499 
learning 
from examples, 1 
machine, 8 
online, 320 
leave-one-out, 250 
machine, 370 
mean field approximation, 377 
Lie group, 337 
likelihood, 69 
linear 
combination, 581 
independence, 581 
map, 582 
Lipschitz continuous, 534 
LLE, see locally linear embedding 
locally linear embedding, 436 
log-likelihood, 419 
logistic regression, 63, 471 
loss function, 18, 19, 62, 394 
€-insensitive, 18, 251 
e-tube, 251 
hinge, 324 
zero-one, 8 
LP machine, 120, 214 
luckiness, 384 


MAP, see maximum a posteriori esti- 
mate 
map 
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injective, 181 Minimum Description Length, 194 
surjective, 577 multidimensional scaling, 436 
margin, 142, 192 
e-, 253 network 
and flatness, 253 neural, 202 
RBF, 203 


coding interpretation, 194 
computational considerations, Newton's method, 156 
204 noise 
of canonical hyperplane, 192 heteroscedastic, 269, 276 
VC bound, 194 input, 192 
vs. training with noise, 192 parameter, 194 
matrix, 582 pattern, 192 


adjoint, 46 norm, 583 
condition of a, 68, 159 operator, 590 
conditionally positive definite, semi, 583 
49 notation table, 625 
ean 585 objective function, 13 
G 30, 430 , observation, 1 
ram, 30, 


: OCR, see optical character recogni- 
inverse, 583 


tion 
inversion lemma, psi Sherman- offset, see threshold 
Woodbury-Morrison formula . : 
online learning, 320 
kernel, 30 582 
itr operator, 
natura ; bounded, 590 
nonsingular, 583 compact, 393 
orthogonal, 585 norm, 590 
positive definite, 30, 585 optical character recognition, 22, 211, 
product, 582 440 


pseudo-inverse, 583 

strictly positive definite, 31 
symmetric, 585 

tangent covariance, 346, 350 


optimization 
constrained, 165 
constraint qualification, 167 
optimality conditions, 166 


transposed, 583 problem 
anv, 585 dual, 199 
Maurey’s Theorem, 398 infeasible, 173 


maximum a posteriori estimate, 476 


sequential minimal, 234 
maximum likelihood, 69 


orthonormal set, 584 


measurable outlier, 236 

function, 579 output, 1 

set, 577 overfitting, 127 
measure, 577 

empirical, 579 parameter optimization, 217 
metric, 583 Parzen windows, 6, 233 


semi, 583 pattern, 1 
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pattern recognition, 2 
PCA, see principal component analy- 
sis 
Peano curve, 531 
perceptron, 193 
Polak-Ribiere method, 163 
pre-image 
approximate, 546 
exact, 544 
precompact, 393 
prediction, 288 
predictor corrector method, 163 
principal component analysis, 427, 
442 
kernel, 20, 431 
linear, 428 
nonlinear, 429, 434 
oriented, 348 
principal curves, 435, 522 
length constraint, 524 
principal manifold, 517 
prior, 472 
data dependent, 500 
improper, 477 
probability, 575 
conditional, 464, 474 
distribution, 576 
measure, 577 
posterior, 464 
space, 575 
programming problem 
dual, 14, 15,19, 171 
linear, 172 
primal, 12, 16, 18 
quadratic, 172 
projection pursuit, 445 
kernel, 445 
proof, see pudding 
pudding, see proof 


quadratic form, 30 

Quantifier Reversal Lemma, 388 

quantile, 80, 450 
multidimensional, 229 


quantization error, 518 
empirical, 519 
expected, 518 


random 
evaluation, 181 
quantity, 576 
subset, 179 
subset selection, 290 
variable, 576 
vector, 576 
rank-1 update, 290 
Rayleigh Coefficient, 457 
reduced KKT system, 176, 296 
reduced set, 258, 544 
Burges method, 561 
construction, 561 
expansion, 553 
method, 22 
selection, 554 
regression, 19 
C-SV, 253 
v-LP, 268 
v-SV, 260 
regularization, 87, 363, 433 
operator, 93 
Fisher, 420 
for polynomial kernels, 110 
for translation invariant ker- 
nels, 96 
natural, 420 
regularized principal manifold, 517 
algorithm, 526 
regularized quantization functional, 
522 
Relevance Vector Machine, 258, 506 
replacing the metric, 160 
Representer Theorem, 89 
restart, 285 
risk, 127 
actual, 8 
Bayes, 9 
bound, 138 
empirical, 8, 127 
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frequentist, 9 
functional, 89, 139 
regularized, 479 
robust estimator, 76 
RPM, see regularized principal mani- 
folds 
RS, see reduced set 
RVM, see Relevance Vector Machine 


sample, 7,577 
complexity, 395 
iid, 577 
mean, 519 
score 
function, 72 
map, 419 
semi-norm, 583 
semiparametric models, 115 
Sequential Minimal Optimization, 
305 
classification, 307 
regression, 308 
selection rule, 311 
SGMA, see sparse greedy matrix ap- 
proximation 
shattering, 9, 136 
coefficient, 136 
Sherman-Woodbury-Morrison, 299, 
489 
significant figures, 178 
similarity measure, 2 
slack variable, 16, 205 
SMO, see Sequential Minimal Opti- 
mization 
snooping, 128 
soft margin loss, 63 
space 
Banach, 586 
dot product, 584 
Hilbert, 586 
linear, 581 
normed, 584 
probability, 575 
vector, 581 
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version, 225 
span, 581 
of an SVM solution, 371 
sparse coding, 500 
sparse decomposition, 443 
sparse greedy 
algorithm, 183 
approximation, 461, 493 
matrix approximation, 288 
sparsity, 120, 460 
SRM, see structural risk minimization 
stability, 361 
statistical manifold, 418 
stochastic process, 580 
stopping criterion, 282 
structural risk minimization, 138 
subset selection, 302 
Support Vector, 6, 14, 197, 202 
bound, 210 
essential, 210 
expansion, 14, 255 
in-bound, 210 
mechanical interpretation, 14 
novelty detection, 227 
pattern recognition, 15 
pattern recognition 
primal reformulation, 559 
quantile estimation, 227 
regression, 17 
regression using vy, 19 
set, 21 
single-class-classification, 227 
virtual, 22, 337 
SV, see Support Vector 
symbol list, 625 
symmetrization, 135 


target, 1 

Taylor series expansion, 155 

tensor product, 410 

test error, 66 

text categorization, 221 

threshold, 17, 203, 205, 209, 298, 310 
topological space, 41 
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training 
example, 8 
Gaussian Processes, 488 
kernel PCA, 429 
KFD, 460 
sparse KFA, 447 
SVM, 279 
transduction, 66 


union bound, 135, 535 
unit ball, 590 
USPS, see dataset 


variable 
dual, 13 
primal, 13 
variance, 580 
VC 
dimension, 9, 141, 391 
real-valued functions, 141 
entropy, 9, 139 
vector quantization, 243, 520 
version space, 225 
virtual example, 335 
virtual SV, 337 


whitening, 347 
working set, 301 


Notation and Symbols 


R the set of reals 

N the set of natural numbers, N = {1,2,...} 

X the input domain 

N (used if X is a vector space) dimension of X 

Xi input patterns 

Yi target values y; € R, or (in pattern recognition) classes y; € {+1} 
m number of training examples 

[n] compact notation for {1,..., m} 

1,] indices, by default running over [m] 

X a sample of input patterns, X = (%1,...,Xm) 

Y a sample of output targets, Y = (y1, - - -, Ym) 

H feature space 

@ feature map, ® : X > H 

Xi a vector with entries [x;];; usually a mapped pattern in H, x; = ®(x;) 
w weight vector in feature space 

b constant offset (or threshold) 

k (positive definite) kernel 

K kernel matrix or Gram matrix, Kj; = k(x;, xj) 

E[€] expectation of a random variable € (Section B.1.3) 

P{-} probability of a logical formula 

P(C) probability of a set (event) C 

p(x) density evaluated at x € X 

N(eé,F,d) covering number of a set F in the metric d with precision € 
Níu, o) normal distribution with mean u and variance o 

E parameter of the s-insensitive loss function 

Qi Lagrange multiplier or expansion coefficient 

Bi Lagrange multiplier 

a, B vectors of Lagrange multipliers 

&j slack variables 

E vector of all slack variables 


Q Hessian of a quadratic program 
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Notation and Symbols 


Òx 
O(g(1)) 
o(s(n)) 


rhs/lhs 


dot product between x and x’ 

2-norm, ||x|| := \/ (x, x) 

p-norm , ||xllp == (EX4 |x”), N € NU {oo} 

oo-norm , ||X|loo = max, |x;| on RN, ||x||o. = sup, |x;| on Lo0 
logarithm to base e 

logarithm to base 2 

a function X > R or X > {+1} 

a family of functions 

margin of function f on the example (x, y), i.e., y+ f(x) 
margin of f on the training set, i.e., min?®, pf(Xi, yi) 

VC dimension 

regularization parameter in front of the empirical risk term 
regularization parameter in front of the regularizer 
intervala<x<b 

intervala<x<b 

intervala<x<b 

inverse matrix (in some cases, pseudo-inverse) 

transposed matrix (or vector) 

adjoint matrix (or: operator, vector), 

i.e., transposed and complex conjugate 

shorthand for a sequence (xj) = (x1, %2,..-) 

sequence spaces, 1 < p < œ (Section B.3.1) 

function spaces, 1 < p < oo (Section B.3.1) 

characteristic (or indicator) function on a set A 

i.e., [4(x) = 1 if x € A and 0 otherwise 

unit matrix, or identity map (1(x) = x for all x) 

cardinality of a set C (for finite sets, the number of elements) 
regularization operator 

Kronecker 6 (Section B.2.1) 

Dirac ô, satisfying f ôx(y)f (y)dy = f(x) 

a function f(1) is said to be O(g(n)) if there exists a constant C 
such that |f (n)| < Cg(n) for all n 

a function is said to be o(g(n)) if there exists a constant c 
such that |f (n)| > cg(n) for all n 

shorthand for “right/left hand side” 

the end of a proof 


easy problem 
intermediate problem 
difficult problem 
open problem 


